Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 9582419
    Abstract: A data processing device 100 comprises a plurality of storage circuits 130, 160, which store a plurality of data elements of the bits in an interleaved manner. Data processing device also comprises a consumer 110 with a number of lanes 120. The consumer is able to individually access each of the plurality of storage circuits 130, 160 in order to receive into the lanes 120 either a subset of the plurality of data elements or y bits of each of the plurality of data elements. The consumer 110 is also able to execute a common instruction of each of the plurality of lanes 120. The relationship of the bits is such that b is greater than y and is an integer multiple of y. Each of the plurality of storage circuits 130, 160 stores at most y bits of each of the data elements. Furthermore, each of the storage circuits 130, 160 stores at most y/b of the plurality of data elements. By carrying out the interleaving in this manner, the plurality of storage circuits 130, 160 comprise no more than b/y storage circuits.
    Type: Grant
    Filed: October 25, 2013
    Date of Patent: February 28, 2017
    Assignee: ARM Limited
    Inventors: Ganesh Suryanarayan Dasika, Rune Holm, Stephen John Hill
  • Patent number: 9575753
    Abstract: Mechanisms, in a data processing system comprising a single instruction multiple data (SIMD) processor, for performing a data dependency check operation on vector element values of at least two input vector registers are provided. Two calls to a simd-check instruction are performed, one with input vector registers having a first order and one with the input vector registers having a different order. The simd-check instruction performs comparisons to determine if any data dependencies are present. Results of the two calls to the simd-check instruction are obtained and used to determine if any data dependencies are present in the at least two input vector registers. Based on the results, the SIMD processor may perform various operations.
    Type: Grant
    Filed: March 15, 2012
    Date of Patent: February 21, 2017
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Bruce M. Fleischer
  • Patent number: 9529571
    Abstract: An apparatus and method for creation of reordered vectors from sequential input data for block based decimation, filtering, interpolation and matrix transposition using a memory circuit for a Single Instruction, Multiple Data (SIMD) Digital Signal Processor (DSP). This memory circuit includes a two-dimensional storage array, a rotate-and-distribute unit, a read-controller and a write to controller, to map input vectors containing sequential data elements in columns of the two-dimensional array and extract reordered target vectors from this array. The data elements and memory configuration are received from the SIMD DSP.
    Type: Grant
    Filed: October 5, 2011
    Date of Patent: December 27, 2016
    Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)
    Inventors: David Van Kampen, Kees Van Berkel, Sven Goossens, Wim Kloosterhuis, Claudiu Zissulescu-Ianculescu
  • Patent number: 9507637
    Abstract: Disclosed are apparatus and methods for managing thread resources. A computing device can generate threads for an executable application. The computing device can receive an allocation request to allocate thread-specific memory for an executable thread of the threads, where thread-specific memory includes a call stack for the executable thread. In response to the allocation request, the computing device can: allocate the thread-specific memory and indicate that the executable thread is ready for execution. The computing device can execute the executable thread. The computing device can receive a sleep request to suspend executable thread execution. In response to the sleep request, the computing device can determine whether the allocated thread-specific memory is to be deallocated. After determining that the allocated thread-specific memory is to be deallocated: the thread-specific memory can be deallocated and an indication that the executable thread execution is suspended can be provided.
    Type: Grant
    Filed: August 8, 2013
    Date of Patent: November 29, 2016
    Assignee: Google Inc.
    Inventor: Winthrop Lyon Saville, III
  • Patent number: 9501449
    Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.
    Type: Grant
    Filed: September 10, 2014
    Date of Patent: November 22, 2016
    Assignee: Sviral, Inc.
    Inventors: Solomon Harsha, Paul Master
  • Patent number: 9495160
    Abstract: Method, apparatus, and program means for performing a string comparison operation. An apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.
    Type: Grant
    Filed: December 5, 2014
    Date of Patent: November 15, 2016
    Assignee: Intel Corporation
    Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
  • Patent number: 9489180
    Abstract: Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least one vector execution unit that allow for parallel execution of tasks on constant-strided memory locations. The first custom computing apparatus optimizes the code for parallelism, locality of operations, constant-strided memory accesses and vectorized execution on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
    Type: Grant
    Filed: November 16, 2012
    Date of Patent: November 8, 2016
    Assignee: Reservoir Labs, Inc.
    Inventors: Muthu Baskaran, Richard A. Lethin, Benoit J. Meister, Nicolas T. Vasilache
  • Patent number: 9424034
    Abstract: A processor includes N-bit registers and a decode unit to receive a multiple register memory access instruction. The multiple register memory access instruction is to indicate a memory location and a register. The processor includes a memory access unit coupled with the decode unit and with the N-bit registers. The memory access unit is to perform a multiple register memory access operation in response to the multiple register memory access instruction. The operation is to involve N-bit data, in each of the N-bit registers comprising the indicated register. The operation is also to involve different corresponding N-bit portions of an M×N-bit line of memory corresponding to the indicated memory location. A total number of bits of the N-bit data in the N-bit registers to be involved in the multiple register memory access operation is to amount to at least half of the M×N-bits of the line of memory.
    Type: Grant
    Filed: June 28, 2013
    Date of Patent: August 23, 2016
    Assignee: Intel Corporation
    Inventors: Glenn Hinton, Bret Toll, Ronak Singhal
  • Patent number: 9405539
    Abstract: Methods, apparatus, instructions and logic provide SIMD vector sub-byte decompression functionality. Embodiments include shuffling a first and second byte into the least significant portion of a first vector element, and a third and fourth byte into the most significant portion. Processing continues shuffling a fifth and sixth byte into the least significant portion of a second vector element, and a seventh and eighth byte into the most significant portion. Then by shifting the first vector element by a first shift count and the second vector element by a second shift count, sub-byte elements are aligned to the least significant bits of their respective bytes. Processors then shuffle a byte from each of the shifted vector elements' least significant portions into byte positions of a destination vector element, and from each of the shifted vector elements' most significant portions into byte positions of another destination vector element.
    Type: Grant
    Filed: July 31, 2013
    Date of Patent: August 2, 2016
    Assignee: Intel Corporation
    Inventors: Tal Uliel, Elmoustapha Ould-Ahmed-Vall, Thomas Willhalm, Robert Valentine
  • Patent number: 9367318
    Abstract: Methods and systems are provided for managing thread execution in a processor. Multiple instructions are fetched from fetch queues. The instructions satisfy the condition that they involve fewer bits than the integer processing pathway that is used to execute them. The instructions are decoded, and divided into groups. The instructions are processed simultaneously through the pathway, such that part of the pathway is used to execute one group of instructions and another part of the pathway is used to execute another group of instructions. These parts are isolated from one another so the execution of the instructions can share the pathway and execute simultaneously and independently.
    Type: Grant
    Filed: November 3, 2015
    Date of Patent: June 14, 2016
    Assignee: Google Inc.
    Inventor: James Laudon
  • Patent number: 9369725
    Abstract: A method for implementing a deblocking filter including the steps of (A) reading pixel values for a plurality of macroblocks of an unfiltered video frame from an input buffer into a working buffer, where the working buffer has dimensions determined by a predefined input region of the deblocking filter and a portion of the working buffer forms a filter output region of the deblocking filter, (B) sequentially processing the pixel values in the working buffer through a plurality of filter processing stages using an array of software-configurable general purpose parallel processors, where each of the plurality of filter processing stages operates on a respective set of the pixel values in the working buffer, and (C) writing filtered pixel values from the filter output region of the working buffer to an output buffer after the plurality of filter processing stages are completed.
    Type: Grant
    Filed: October 22, 2012
    Date of Patent: June 14, 2016
    Assignee: Amazon Technologies, Inc.
    Inventor: Brian G. Lewis
  • Patent number: 9342334
    Abstract: A system and method for simulating new instructions without compiler support for the new instructions. A simulator detects a given region in code generated by a compiler. The given region may be a candidate for vectorization or may be a region already vectorized. In response to the detection, the simulator suspends execution of a time-based simulation. The simulator then serially executes the region for at least two iterations using a functional-based simulation and using instructions with operands which correspond to P or less lanes of single-instruction-multiple-data (SIMD) execution. The value P is a maximum number of lanes of SIMD exection supported both by the compiler. The simulator stores checkpoint state during the serial execution. In response to determining no inter-iteration memory dependencies exist, the simulator returns to the time-based simulation and resumes execution using N-wide vector instructions.
    Type: Grant
    Filed: June 22, 2012
    Date of Patent: May 17, 2016
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Bradford M. Beckmann, Nilay Vaish, Steven K. Reinhardt
  • Patent number: 9329870
    Abstract: A method and circuit arrangement tightly couple together decode logic associated with multiple types of execution units and having varying priorities to enable instructions that are decoded as valid instructions for multiple types of execution units to be forwarded to a highest priority type of execution unit among the multiple types of execution units. Among other benefits, when an auxiliary execution unit is coupled to a general purpose processing core with the decode logic for the auxiliary execution unit tightly coupled with the decode logic for the general purpose processing core, the auxiliary execution unit may be used to effectively overlay new functionality for an existing instruction that is normally executed by the general purpose processing core, e.g., to patch a design flaw in the general purpose processing core or to provide improved performance for specialized applications.
    Type: Grant
    Filed: February 13, 2013
    Date of Patent: May 3, 2016
    Assignee: International Business Machines Corporation
    Inventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
  • Patent number: 9329671
    Abstract: Computer system, method and computer program product for scheduling IPC activities are disclosed. In one embodiment, the computer system includes first processor and second processors that communicate with each other via IPC activities. The second processor may operate in a first mode in which the second processor is able to process IPC activities, or a second mode in which the second processor does not process IPC activities. Processing apparatus associated with the first processor identifies which of the pending IPC activities for communicating from the first processor to the second processor are not real-time sensitive, and schedules the identified IPC activities for communicating from the first processor to the second processor by delaying some of the identified IPC activities to thereby group them together. The grouped IPC activities are scheduled for communicating to the second processor during a period in which the second processor is continuously in the first mode.
    Type: Grant
    Filed: January 29, 2013
    Date of Patent: May 3, 2016
    Assignee: Nvidia Corporation
    Inventors: Greg Heinrich, Philippe Guasch
  • Patent number: 9317296
    Abstract: Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.
    Type: Grant
    Filed: December 21, 2012
    Date of Patent: April 19, 2016
    Assignee: ADVANCED MICRO DEVICES, INC.
    Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
  • Patent number: 9311102
    Abstract: Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.
    Type: Grant
    Filed: July 12, 2011
    Date of Patent: April 12, 2016
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Tushar K. Shah, Michael J. Mantor, Brian Emberling
  • Patent number: 9275984
    Abstract: A multi-chip package system includes a signal transmission line commonly coupled to a plurality of semiconductor chips to transfer data to/from the semiconductor chips from/to outside; and a termination controller suitable for detecting a loading value of the signal transmission line and controlling a termination operation on the signal transmission line based on the loading value.
    Type: Grant
    Filed: July 5, 2013
    Date of Patent: March 1, 2016
    Assignee: SK Hynix Inc.
    Inventor: Chun-Seok Jeong
  • Patent number: 9262704
    Abstract: Methods and systems render higher bit per pixel contone images to lower bit formats using multiple registers of a SIMD processor. The rendering process uses a first register to maintain contone image values of all the pixels being simultaneously processed. A second register maintains a threshold value used during the conversion process. A third register maintains one value for the print ready format pixels (e.g., those having less bits per pixel), and a fourth register maintains the other value (e.g., 0) for the print ready format pixels. Also, a fifth register maintains the conversion error amount for all the pixels being simultaneously processed. Sixth through ninth registers maintain distributed conversion error amounts produced by the diffusing process (for different pixels being simultaneously processed); and a tenth register maintains the pixels in the print-ready format produced by the conversion for all the pixels being simultaneously processed.
    Type: Grant
    Filed: March 4, 2015
    Date of Patent: February 16, 2016
    Assignee: Xerox Corporation
    Inventors: David Jon Metcalfe, Ryan David Metcalfe
  • Patent number: 9229721
    Abstract: This disclosure is directed to techniques for executing subroutines in a single instruction, multiple data (SIMD) processing system that is subject to divergent thread conditions. In particular, a resume counter-based approach for managing divergent thread state is described that utilizes program module-specific minimum resume counters (MINRCs) for the efficient processing of control flow instructions. In some examples, the techniques of this disclosure may include using a main program MINRC to control the execution of a main program module and subroutine-specific MINRCs to control the execution of subroutine program modules. Techniques are also described for managing the main program MINRC and subroutine-specific MINRCs when subroutine call and return instructions are executed. Techniques are also described for updating a subroutine-specific MINRC to ensure that the updated MINRC value for the subroutine-specific MINRC is within the program space allocated for the subroutine.
    Type: Grant
    Filed: September 10, 2012
    Date of Patent: January 5, 2016
    Assignee: QUALCOMM Incorporated
    Inventor: Lin Chen
  • Patent number: 9195675
    Abstract: Embodiments provide methods and systems for encoding and decoding variable-length data, which may include methods for encoding and decoding search engine posting lists. Embodiments may include different encoding formats including group unary, packed unary, and/or packed binary formats. Some embodiments may utilize single instruction multiple data (SIMD) instructions that may perform a parallel shuffle operation on encoded data as part of the decoding processes. Some embodiments may utilize lookup tables to determine shuffle sequences and/or masks and/or shifts to be utilized in the decoding processes. Some embodiments may utilize hybrid formats.
    Type: Grant
    Filed: March 31, 2011
    Date of Patent: November 24, 2015
    Assignee: A9.com, Inc.
    Inventors: Daniel E. Rose, Alexander A. Stepanov, Anil Ramesh Gangolli, Paramjit S. Oberoi, Ryan Jacob Ernst
  • Patent number: 9189828
    Abstract: An accelerator system is implemented on an expansion card comprising a printed circuit board having (a) one or more graphics processing units (GPUs), (b) two or more associated memory banks (logically or physically partitioned), (c) a specialized controller, and (d) a local bus providing signal coupling compatible with the PCI industry standards. The controller handles most of the primitive operations to set up and control GPU computation. Thus, the computer's central processing unit (CPU) can be dedicated to other tasks. In this case a few controls (simulation start and stop signals from the CPU and the simulation completion signal back to CPU), GPU programs and input/output data are exchanged between CPU and the expansion card. Moreover, since on every time step of the simulation the results from the previous time step are used but not changed, the results are preferably transferred back to CPU in parallel with the computation.
    Type: Grant
    Filed: January 3, 2014
    Date of Patent: November 17, 2015
    Assignee: Neurala, Inc.
    Inventors: Anatoli Gorchetchnikov, Heather Marie Ames, Massimiliano Versace, Fabrizio Santini
  • Patent number: 9183907
    Abstract: One or more techniques for improving Vccmin for a dual port synchronous random access memory (DPSRAM) cell utilized as a single port synchronous random access memory (SPSRAM) cell are provided herein. In some embodiments, a second word line signal is sent to a second word line of the DPSRAM cell. For example, the second word line signal is sent in response to a logical low at a first bit line or a logical low at a second bit line. In this way, Vccmin is improved for the DPSRAM cell.
    Type: Grant
    Filed: November 28, 2012
    Date of Patent: November 10, 2015
    Assignee: Taiwan Semiconductor Manufacturing Company Limited
    Inventors: Ching-Wei Wu, Cheng Hung Lee, Chia-Cheng Chen
  • Patent number: 9164770
    Abstract: There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays.
    Type: Grant
    Filed: November 23, 2009
    Date of Patent: October 20, 2015
    Assignee: Mindspeed Technologies, Inc.
    Inventor: Patrick D. Ryan
  • Patent number: 9159276
    Abstract: According to one embodiment of the present invention, a method for creating bit planes from frame data for a digital mirror device is disclosed including forming data elements comprising bits of equal significance from a plurality of pixel data in the frame data, the forming including using dual index direct memory address operations.
    Type: Grant
    Filed: December 20, 2007
    Date of Patent: October 13, 2015
    Assignee: TEXAS INSTRUMENTS INCORPORATED
    Inventors: James N. Malina, Leonardo W. Estevez, Gunter Schmer
  • Patent number: 9147470
    Abstract: Apparatus for programming a non-volatile memory, the apparatus having corresponding methods and tangible computer-readable media, comprise: a command memory configured to hold a plurality of command templates, wherein each of the command templates specifies a sequence of pad signals; a state machine configured to i) receive descriptors, wherein each of the descriptors includes a pointer to a respective one of the command templates in the command memory, and ii) generate the sequence of pad signals based on the command template indicated by the respective pointer; and a non-volatile memory interface configured to provide, to pads of the non-volatile memory, the sequence of pad signals generated by the state machine.
    Type: Grant
    Filed: November 12, 2012
    Date of Patent: September 29, 2015
    Assignee: MARVELL INTERNATIONAL LTD.
    Inventors: Chih-Ching Chen, Hyunsuk Shin, Chi Kong Lee, Xueting Yu
  • Patent number: 9092227
    Abstract: A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs.
    Type: Grant
    Filed: May 2, 2012
    Date of Patent: July 28, 2015
    Inventors: Anindya Saha, Gururaj Padaki, Santosh Billava, Rakesh A. Joshi
  • Patent number: 9086872
    Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.
    Type: Grant
    Filed: June 30, 2009
    Date of Patent: July 21, 2015
    Assignee: Intel Corporation
    Inventors: Asaf Hargil, Doron Orenstein
  • Patent number: 9081562
    Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.
    Type: Grant
    Filed: March 15, 2013
    Date of Patent: July 14, 2015
    Assignee: Intel Corporation
    Inventors: Asaf Hargil, Doron Orenstein
  • Publication number: 20150127924
    Abstract: A method and corresponding apparatus for processing a shuffle instruction are provided. Shuffle units are configured in a hierarchical structure, and each of the shuffle units generates a shuffled data element array by performing shuffling on an input data element array. In the hierarchical structure, which includes an upper shuffle unit and a lower shuffle unit, the shuffled data element array output from the lower shuffle unit is input to the upper shuffle unit as a portion of the input data element array for the upper shuffle unit.
    Type: Application
    Filed: July 14, 2014
    Publication date: May 7, 2015
    Applicant: SAMSUNG ELECTRONICS CO., LTD.
    Inventors: Keshava PRASAD, Navneet BASUTKAR, Young Hwan PARK, Ho YANG, Yeon Bok LEE
  • Publication number: 20150100758
    Abstract: A data processor includes a register file divided into at least a first portion and a second portion for storing data. A single instruction, multiple data (SIMD) unit is also divided into at least a first lane and a second lane. The first and second lanes of the SIMD unit correspond respectively to the first and second portions of the register file. Furthermore, each lane of the SIMD unit is capable of data processing. The data processor also includes a realignment element in communication with the register file and the SIMD unit. The realignment element is configured to selectively realign conveyance of data between the first portion of the register file and the first lane of the SIMD unit to the second lane of the SIMD unit.
    Type: Application
    Filed: October 3, 2013
    Publication date: April 9, 2015
    Applicant: ADVANCED MICRO DEVICES, INC.
    Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
  • Patent number: 8996845
    Abstract: A vector compare-and-exchange operation is performed by: decoding by a decoder in a processing device, a single instruction specifying a vector compare-and-exchange operation for a plurality of data elements between a first storage location, a second storage location, and a third storage location; issuing the single instruction for execution by an execution unit in the processing device; and responsive to the execution of the single instruction, comparing data elements from the first storage location to corresponding data elements in the second storage location; and responsive to determining a match exists, replacing the data elements from the first storage location with corresponding data elements from the third storage location.
    Type: Grant
    Filed: December 22, 2009
    Date of Patent: March 31, 2015
    Assignee: Intel Corporation
    Inventors: Ravi Rajwar, Andrew T. Forsyth
  • Patent number: 8959319
    Abstract: Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.
    Type: Grant
    Filed: December 2, 2011
    Date of Patent: February 17, 2015
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Mark Leather, Norman Rubin, Brian D. Emberling, Michael Mantor
  • Patent number: 8954943
    Abstract: A method for analyzing data reordering operations in Single Issue Multiple Data source code and generating executable code therefrom is provided. Input is received. One or more data reordering operations in the input are identified and each data reordering operation in the input is abstracted into a corresponding virtual shuffle operation so that each virtual shuffle operation forms part of an expression tree. One or more virtual shuffle trees are collapsed by combining virtual shuffle operations within at least one of the one or more virtual shuffle trees to form one or more combined virtual shuffle operations, wherein each virtual shuffle tree is a subtree of the expression tree that only contains virtual shuffle operations. Then code is generated for the one or more combined virtual shuffle operations.
    Type: Grant
    Filed: January 26, 2006
    Date of Patent: February 10, 2015
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu, Peng Zhao
  • Publication number: 20150019838
    Abstract: A method of loading and duplicating scalar data from a source into a destination register. The data may be duplicated in byte, half word, word or double word parts, according to a duplication pattern.
    Type: Application
    Filed: July 9, 2014
    Publication date: January 15, 2015
    Inventors: Timothy David Anderson, Duc Quang Bui, Peter Richard Dent
  • Publication number: 20150012724
    Abstract: A data processing apparatus has permutation circuitry for performing a permutation operation for changing a data element size or data element positioning of at least one source operand to generate first and second SIMD operands, and SIMD processing circuitry for performing a SIMD operation on the first and second SIMD operands. In response to a first SIMD instruction requiring a permutation operation, the instruction decoder controls the permutation circuitry to perform the permutation operation to generate the first and second SIMD operands and then controls the SIMD processing circuitry to perform the SIMD operation using these operands. In response to a second SIMD instruction not requiring a permutation operation, the instruction decoder controls the SIMD processing circuitry to perform the SIMD operation using the first and second SIMD operands identified by the instruction, without passing them via the permutation circuitry.
    Type: Application
    Filed: July 8, 2013
    Publication date: January 8, 2015
    Inventors: David Raymond LUTZ, Neil BURGESS
  • Patent number: 8918553
    Abstract: A mechanism programming a direct memory access engine operating as a multithreaded processor is provided. A plurality of programs is received from a host processor in a local memory associated with the direct memory access engine. A request is received in the direct memory access engine from the host processor indicating that the plurality of programs located in the local memory is to be executed. The direct memory access engine executes two or more of the plurality of programs without intervention by a host processor. As each of the two or more of the plurality of programs completes execution, the direct memory access engine sends a completion notification to the host processor that indicates that the program has completed execution.
    Type: Grant
    Filed: June 5, 2012
    Date of Patent: December 23, 2014
    Assignee: International Business Machines Corporation
    Inventors: Brian K. Flachs, Harm P. Hofstee, Charles R. Johns, Matthew E. King, John S. Liberty, Brad W. Michael
  • Patent number: 8914613
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: August 26, 2011
    Date of Patent: December 16, 2014
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 8898432
    Abstract: Systems and methods for folding a single instruction multiple data (SIMD) array include a newly defined processing element group (PEG) that allows interconnection of PEGs by abutment without requiring a row or column weave pattern. The interconnected PEGs form a SIMD array that is effectively folded at its center along the North-South axis, and may also be folded along the East-West axis. The folding of the array provides for north and south boundaries to be co-located and for east and west boundaries to be co-located. The co-location allows wrap-around connections to be done with a propagation distance reduced effectively to zero.
    Type: Grant
    Filed: October 25, 2011
    Date of Patent: November 25, 2014
    Assignee: Geo Semiconductor, Inc.
    Inventor: Woodrow L. Meeker
  • Patent number: 8892781
    Abstract: A computer program product, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes receiving a command block from the channel subsystem, the command block including at least one input command and at least one output command specified by a transport command word (TCW) and associated with the I/O operation, the I/O operation having both input and output data, the TCW specifying a location in the memory of the output data and a location in the memory for storing the input data; receiving the output data specified by the TCW and executing the at least one output command; and forwarding the input data specified by the TCW to the channel subsystem for storage at a location specified by the TCW.
    Type: Grant
    Filed: June 13, 2013
    Date of Patent: November 18, 2014
    Assignee: International Business Machines Corporation
    Inventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann, III
  • Patent number: 8862827
    Abstract: A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location.
    Type: Grant
    Filed: December 29, 2009
    Date of Patent: October 14, 2014
    Assignee: International Business Machines Corporation
    Inventors: Brian Flachs, Barry L. Minor, Mark Richard Nutter
  • Patent number: 8856494
    Abstract: Data processing circuit containing an instruction execution circuit having an instruction set comprising a SIMD instruction. The instruction execution circuit comprises arithmetic circuits, arranged to perform N respective identical operations in parallel in response to the SIMD instruction. The SIMD instruction selects a first one and a second one of the registers. The SIMD instruction defines a first and second series of N respective SIMD instruction operands of the SIMD instruction from the addressed registers. Each arithmetic circuit receives a respective first operand and a respective second operand from the first and second series respectively. The instruction execution circuit selects the first and second series so they partially overlap. Positioning the operands is under program control.
    Type: Grant
    Filed: January 11, 2012
    Date of Patent: October 7, 2014
    Assignee: Intel Corporation
    Inventor: Antonius A. M. Van Wel
  • Patent number: 8838946
    Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.
    Type: Grant
    Filed: December 29, 2012
    Date of Patent: September 16, 2014
    Assignee: Intel Corporation
    Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
  • Patent number: 8832417
    Abstract: This disclosure describes techniques for handling divergent thread conditions in a multi-threaded processing system. In some examples, a control flow unit may obtain a control flow instruction identified by a program counter value stored in a program counter register. The control flow instruction may include a target value indicative of a target program counter value for the control flow instruction. The control flow unit may select one of the target program counter value and a minimum resume counter value as a value to load into the program counter register. The minimum resume counter value may be indicative of a smallest resume counter value from a set of one or more resume counter values associated with one or more inactive threads. Each of the one or more resume counter values may be indicative of a program counter value at which a respective inactive thread should be activated.
    Type: Grant
    Filed: September 7, 2011
    Date of Patent: September 9, 2014
    Assignee: QUALCOMM Incorporated
    Inventors: Lin Chen, David Rigel Garcia Garcia, Andrew E. Gruber, Guofang Jiao
  • Patent number: 8825987
    Abstract: Methods, apparatus, and instructions for performing string comparison operations. An apparatus may include execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.
    Type: Grant
    Filed: December 20, 2012
    Date of Patent: September 2, 2014
    Assignee: Intel Corporation
    Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
  • Patent number: 8819394
    Abstract: Methods, apparatus, and instructions for performing string comparison operations. In one embodiment, an apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.
    Type: Grant
    Filed: December 20, 2012
    Date of Patent: August 26, 2014
    Assignee: Intel Corporation
    Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
  • Publication number: 20140208068
    Abstract: Compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.
    Type: Application
    Filed: January 22, 2013
    Publication date: July 24, 2014
    Applicant: Samplify systems, Inc.
    Inventor: ALBERT W. WEGENER
  • Publication number: 20140208069
    Abstract: An execution unit configured for compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.
    Type: Application
    Filed: January 22, 2013
    Publication date: July 24, 2014
    Applicant: SAMPLIFY SYSTEMS, INC.
    Inventor: ALBERT W. WEGENER
  • Publication number: 20140201498
    Abstract: Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results.
    Type: Application
    Filed: September 26, 2011
    Publication date: July 17, 2014
    Applicant: Intel Corporation
    Inventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Charles R. Yount, Suleyman Sair
  • Publication number: 20140195778
    Abstract: Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults.
    Type: Application
    Filed: September 26, 2011
    Publication date: July 10, 2014
    Applicant: Intel Corporation
    Inventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Suleyman Sair, Charles R. Yount
  • Patent number: 8775147
    Abstract: An algorithm and architecture are disclosed for performing multi-argument associative operations. The algorithm and architecture can be used to schedule operations on multiple facilities for computations or can be used in the development of a model in a modeling environment. The algorithm and architecture resulting from the algorithm use the latency of the components that are used to process the associative operations. The algorithm minimizes the number of components necessary to produce an output of multi-argument associative operations and also can minimize the number of inputs each component receives.
    Type: Grant
    Filed: May 31, 2006
    Date of Patent: July 8, 2014
    Assignee: The MathWorks, Inc.
    Inventors: Alireza Pakyari, Brian K. Ogilvie