Single Instruction, Multiple Data (simd) Patents (Class 712/22)

Data processing device and method for interleaved storage of data elements

Patent number: 9582419

Abstract: A data processing device 100 comprises a plurality of storage circuits 130, 160, which store a plurality of data elements of the bits in an interleaved manner. Data processing device also comprises a consumer 110 with a number of lanes 120. The consumer is able to individually access each of the plurality of storage circuits 130, 160 in order to receive into the lanes 120 either a subset of the plurality of data elements or y bits of each of the plurality of data elements. The consumer 110 is also able to execute a common instruction of each of the plurality of lanes 120. The relationship of the bits is such that b is greater than y and is an integer multiple of y. Each of the plurality of storage circuits 130, 160 stores at most y bits of each of the data elements. Furthermore, each of the storage circuits 130, 160 stores at most y/b of the plurality of data elements. By carrying out the interleaving in this manner, the plurality of storage circuits 130, 160 comprise no more than b/y storage circuits.

Type: Grant

Filed: October 25, 2013

Date of Patent: February 28, 2017

Assignee: ARM Limited

Inventors: Ganesh Suryanarayan Dasika, Rune Holm, Stephen John Hill
SIMD compare instruction using permute logic for distributed register files

Patent number: 9575753

Abstract: Mechanisms, in a data processing system comprising a single instruction multiple data (SIMD) processor, for performing a data dependency check operation on vector element values of at least two input vector registers are provided. Two calls to a simd-check instruction are performed, one with input vector registers having a first order and one with the input vector registers having a different order. The simd-check instruction performs comparisons to determine if any data dependencies are present. Results of the two calls to the simd-check instruction are obtained and used to determine if any data dependencies are present in the at least two input vector registers. Based on the results, the SIMD processor may perform various operations.

Type: Grant

Filed: March 15, 2012

Date of Patent: February 21, 2017

Assignee: International Business Machines Corporation

Inventors: Alexandre E. Eichenberger, Bruce M. Fleischer
SIMD memory circuit and methodology to support upsampling, downsampling and transposition

Patent number: 9529571

Abstract: An apparatus and method for creation of reordered vectors from sequential input data for block based decimation, filtering, interpolation and matrix transposition using a memory circuit for a Single Instruction, Multiple Data (SIMD) Digital Signal Processor (DSP). This memory circuit includes a two-dimensional storage array, a rotate-and-distribute unit, a read-controller and a write to controller, to map input vectors containing sequential data elements in columns of the two-dimensional array and extract reordered target vectors from this array. The data elements and memory configuration are received from the SIMD DSP.

Type: Grant

Filed: October 5, 2011

Date of Patent: December 27, 2016

Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)

Inventors: David Van Kampen, Kees Van Berkel, Sven Goossens, Wim Kloosterhuis, Claudiu Zissulescu-Ianculescu
Computer platform where tasks can optionally share per task resources

Patent number: 9507637

Abstract: Disclosed are apparatus and methods for managing thread resources. A computing device can generate threads for an executable application. The computing device can receive an allocation request to allocate thread-specific memory for an executable thread of the threads, where thread-specific memory includes a call stack for the executable thread. In response to the allocation request, the computing device can: allocate the thread-specific memory and indicate that the executable thread is ready for execution. The computing device can execute the executable thread. The computing device can receive a sleep request to suspend executable thread execution. In response to the sleep request, the computing device can determine whether the allocated thread-specific memory is to be deallocated. After determining that the allocated thread-specific memory is to be deallocated: the thread-specific memory can be deallocated and an indication that the executable thread execution is suspended can be provided.

Type: Grant

Filed: August 8, 2013

Date of Patent: November 29, 2016

Assignee: Google Inc.

Inventor: Winthrop Lyon Saville, III
Method, apparatus, and computer-readable medium for parallelization of a computer program on a plurality of computing cores

Patent number: 9501449

Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.

Type: Grant

Filed: September 10, 2014

Date of Patent: November 22, 2016

Assignee: Sviral, Inc.

Inventors: Solomon Harsha, Paul Master
Instruction and logic for processing text strings

Patent number: 9495160

Abstract: Method, apparatus, and program means for performing a string comparison operation. An apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.

Type: Grant

Filed: December 5, 2014

Date of Patent: November 15, 2016

Assignee: Intel Corporation

Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
Methods and apparatus for joint scheduling and layout optimization to enable multi-level vectorization

Patent number: 9489180

Abstract: Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least one vector execution unit that allow for parallel execution of tasks on constant-strided memory locations. The first custom computing apparatus optimizes the code for parallelism, locality of operations, constant-strided memory accesses and vectorized execution on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.

Type: Grant

Filed: November 16, 2012

Date of Patent: November 8, 2016

Assignee: Reservoir Labs, Inc.

Inventors: Muthu Baskaran, Richard A. Lethin, Benoit J. Meister, Nicolas T. Vasilache
Multiple register memory access instructions, processors, methods, and systems

Patent number: 9424034

Abstract: A processor includes N-bit registers and a decode unit to receive a multiple register memory access instruction. The multiple register memory access instruction is to indicate a memory location and a register. The processor includes a memory access unit coupled with the decode unit and with the N-bit registers. The memory access unit is to perform a multiple register memory access operation in response to the multiple register memory access instruction. The operation is to involve N-bit data, in each of the N-bit registers comprising the indicated register. The operation is also to involve different corresponding N-bit portions of an M×N-bit line of memory corresponding to the indicated memory location. A total number of bits of the N-bit data in the N-bit registers to be involved in the multiple register memory access operation is to amount to at least half of the M×N-bits of the line of memory.

Type: Grant

Filed: June 28, 2013

Date of Patent: August 23, 2016

Assignee: Intel Corporation

Inventors: Glenn Hinton, Bret Toll, Ronak Singhal
Providing vector sub-byte decompression functionality

Patent number: 9405539

Abstract: Methods, apparatus, instructions and logic provide SIMD vector sub-byte decompression functionality. Embodiments include shuffling a first and second byte into the least significant portion of a first vector element, and a third and fourth byte into the most significant portion. Processing continues shuffling a fifth and sixth byte into the least significant portion of a second vector element, and a seventh and eighth byte into the most significant portion. Then by shifting the first vector element by a first shift count and the second vector element by a second shift count, sub-byte elements are aligned to the least significant bits of their respective bytes. Processors then shuffle a byte from each of the shifted vector elements' least significant portions into byte positions of a destination vector element, and from each of the shifted vector elements' most significant portions into byte positions of another destination vector element.

Type: Grant

Filed: July 31, 2013

Date of Patent: August 2, 2016

Assignee: Intel Corporation

Inventors: Tal Uliel, Elmoustapha Ould-Ahmed-Vall, Thomas Willhalm, Robert Valentine
Doubling thread resources in a processor

Patent number: 9367318

Abstract: Methods and systems are provided for managing thread execution in a processor. Multiple instructions are fetched from fetch queues. The instructions satisfy the condition that they involve fewer bits than the integer processing pathway that is used to execute them. The instructions are decoded, and divided into groups. The instructions are processed simultaneously through the pathway, such that part of the pathway is used to execute one group of instructions and another part of the pathway is used to execute another group of instructions. These parts are isolated from one another so the execution of the instructions can share the pathway and execute simultaneously and independently.

Type: Grant

Filed: November 3, 2015

Date of Patent: June 14, 2016

Assignee: Google Inc.

Inventor: James Laudon
Method of efficiently implementing a MPEG-4 AVC deblocking filter on an array of parallel processors

Patent number: 9369725

Abstract: A method for implementing a deblocking filter including the steps of (A) reading pixel values for a plurality of macroblocks of an unfiltered video frame from an input buffer into a working buffer, where the working buffer has dimensions determined by a predefined input region of the deblocking filter and a portion of the working buffer forms a filter output region of the deblocking filter, (B) sequentially processing the pixel values in the working buffer through a plurality of filter processing stages using an array of software-configurable general purpose parallel processors, where each of the plurality of filter processing stages operates on a respective set of the pixel values in the working buffer, and (C) writing filtered pixel values from the filter output region of the working buffer to an output buffer after the plurality of filter processing stages are completed.

Type: Grant

Filed: October 22, 2012

Date of Patent: June 14, 2016

Assignee: Amazon Technologies, Inc.

Inventor: Brian G. Lewis
Simulating vector execution

Patent number: 9342334

Abstract: A system and method for simulating new instructions without compiler support for the new instructions. A simulator detects a given region in code generated by a compiler. The given region may be a candidate for vectorization or may be a region already vectorized. In response to the detection, the simulator suspends execution of a time-based simulation. The simulator then serially executes the region for at least two iterations using a functional-based simulation and using instructions with operands which correspond to P or less lanes of single-instruction-multiple-data (SIMD) execution. The value P is a maximum number of lanes of SIMD exection supported both by the compiler. The simulator stores checkpoint state during the serial execution. In response to determining no inter-iteration memory dependencies exist, the simulator returns to the time-based simulation and resumes execution using N-wide vector instructions.

Type: Grant

Filed: June 22, 2012

Date of Patent: May 17, 2016

Assignee: Advanced Micro Devices, Inc.

Inventors: Bradford M. Beckmann, Nilay Vaish, Steven K. Reinhardt
Extensible execution unit interface architecture with multiple decode logic and multiple execution units

Patent number: 9329870

Abstract: A method and circuit arrangement tightly couple together decode logic associated with multiple types of execution units and having varying priorities to enable instructions that are decoded as valid instructions for multiple types of execution units to be forwarded to a highest priority type of execution unit among the multiple types of execution units. Among other benefits, when an auxiliary execution unit is coupled to a general purpose processing core with the decode logic for the auxiliary execution unit tightly coupled with the decode logic for the general purpose processing core, the auxiliary execution unit may be used to effectively overlay new functionality for an existing instruction that is normally executed by the general purpose processing core, e.g., to patch a design flaw in the general purpose processing core or to provide improved performance for specialized applications.

Type: Grant

Filed: February 13, 2013

Date of Patent: May 3, 2016

Assignee: International Business Machines Corporation

Inventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
Power-efficient inter processor communication scheduling

Patent number: 9329671

Abstract: Computer system, method and computer program product for scheduling IPC activities are disclosed. In one embodiment, the computer system includes first processor and second processors that communicate with each other via IPC activities. The second processor may operate in a first mode in which the second processor is able to process IPC activities, or a second mode in which the second processor does not process IPC activities. Processing apparatus associated with the first processor identifies which of the pending IPC activities for communicating from the first processor to the second processor are not real-time sensitive, and schedules the identified IPC activities for communicating from the first processor to the second processor by delaying some of the identified IPC activities to thereby group them together. The grouped IPC activities are scheduled for communicating to the second processor during a period in which the second processor is continuously in the first mode.

Type: Grant

Filed: January 29, 2013

Date of Patent: May 3, 2016

Assignee: Nvidia Corporation

Inventors: Greg Heinrich, Philippe Guasch
High level software execution mask override

Patent number: 9317296

Abstract: Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.

Type: Grant

Filed: December 21, 2012

Date of Patent: April 19, 2016

Assignee: ADVANCED MICRO DEVICES, INC.

Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
Dynamic control of SIMDs

Patent number: 9311102

Abstract: Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.

Type: Grant

Filed: July 12, 2011

Date of Patent: April 12, 2016

Assignee: Advanced Micro Devices, Inc.

Inventors: Tushar K. Shah, Michael J. Mantor, Brian Emberling
Multi-chip package system

Patent number: 9275984

Abstract: A multi-chip package system includes a signal transmission line commonly coupled to a plurality of semiconductor chips to transfer data to/from the semiconductor chips from/to outside; and a termination controller suitable for detecting a loading value of the signal transmission line and controlling a termination operation on the signal transmission line based on the loading value.

Type: Grant

Filed: July 5, 2013

Date of Patent: March 1, 2016

Assignee: SK Hynix Inc.

Inventor: Chun-Seok Jeong
Rendering images to lower bits per pixel formats using reduced numbers of registers

Patent number: 9262704

Abstract: Methods and systems render higher bit per pixel contone images to lower bit formats using multiple registers of a SIMD processor. The rendering process uses a first register to maintain contone image values of all the pixels being simultaneously processed. A second register maintains a threshold value used during the conversion process. A third register maintains one value for the print ready format pixels (e.g., those having less bits per pixel), and a fourth register maintains the other value (e.g., 0) for the print ready format pixels. Also, a fifth register maintains the conversion error amount for all the pixels being simultaneously processed. Sixth through ninth registers maintain distributed conversion error amounts produced by the diffusing process (for different pixels being simultaneously processed); and a tenth register maintains the pixels in the print-ready format produced by the conversion for all the pixels being simultaneously processed.

Type: Grant

Filed: March 4, 2015

Date of Patent: February 16, 2016

Assignee: Xerox Corporation

Inventors: David Jon Metcalfe, Ryan David Metcalfe
Executing subroutines in a multi-threaded processing system

Patent number: 9229721

Abstract: This disclosure is directed to techniques for executing subroutines in a single instruction, multiple data (SIMD) processing system that is subject to divergent thread conditions. In particular, a resume counter-based approach for managing divergent thread state is described that utilizes program module-specific minimum resume counters (MINRCs) for the efficient processing of control flow instructions. In some examples, the techniques of this disclosure may include using a main program MINRC to control the execution of a main program module and subroutine-specific MINRCs to control the execution of subroutine program modules. Techniques are also described for managing the main program MINRC and subroutine-specific MINRCs when subroutine call and return instructions are executed. Techniques are also described for updating a subroutine-specific MINRC to ensure that the updated MINRC value for the subroutine-specific MINRC is within the program space allocated for the subroutine.

Type: Grant

Filed: September 10, 2012

Date of Patent: January 5, 2016

Assignee: QUALCOMM Incorporated

Inventor: Lin Chen
Decoding of variable-length data with group formats

Patent number: 9195675

Abstract: Embodiments provide methods and systems for encoding and decoding variable-length data, which may include methods for encoding and decoding search engine posting lists. Embodiments may include different encoding formats including group unary, packed unary, and/or packed binary formats. Some embodiments may utilize single instruction multiple data (SIMD) instructions that may perform a parallel shuffle operation on encoded data as part of the decoding processes. Some embodiments may utilize lookup tables to determine shuffle sequences and/or masks and/or shifts to be utilized in the decoding processes. Some embodiments may utilize hybrid formats.

Type: Grant

Filed: March 31, 2011

Date of Patent: November 24, 2015

Assignee: A9.com, Inc.

Inventors: Daniel E. Rose, Alexander A. Stepanov, Anil Ramesh Gangolli, Paramjit S. Oberoi, Ryan Jacob Ernst
Graphic processor based accelerator system and method

Patent number: 9189828

Abstract: An accelerator system is implemented on an expansion card comprising a printed circuit board having (a) one or more graphics processing units (GPUs), (b) two or more associated memory banks (logically or physically partitioned), (c) a specialized controller, and (d) a local bus providing signal coupling compatible with the PCI industry standards. The controller handles most of the primitive operations to set up and control GPU computation. Thus, the computer's central processing unit (CPU) can be dedicated to other tasks. In this case a few controls (simulation start and stop signals from the CPU and the simulation completion signal back to CPU), GPU programs and input/output data are exchanged between CPU and the expansion card. Moreover, since on every time step of the simulation the results from the previous time step are used but not changed, the results are preferably transferred back to CPU in parallel with the computation.

Type: Grant

Filed: January 3, 2014

Date of Patent: November 17, 2015

Assignee: Neurala, Inc.

Inventors: Anatoli Gorchetchnikov, Heather Marie Ames, Massimiliano Versace, Fabrizio Santini
Vccmin for a dual port synchronous random access memory (DPSRAM) cell utilized as a single port synchronous random access memory (SPSRAM) cell

Patent number: 9183907

Abstract: One or more techniques for improving Vccmin for a dual port synchronous random access memory (DPSRAM) cell utilized as a single port synchronous random access memory (SPSRAM) cell are provided herein. In some embodiments, a second word line signal is sent to a second word line of the DPSRAM cell. For example, the second word line signal is sent in response to a logical low at a first bit line or a logical low at a second bit line. In this way, Vccmin is improved for the DPSRAM cell.

Type: Grant

Filed: November 28, 2012

Date of Patent: November 10, 2015

Assignee: Taiwan Semiconductor Manufacturing Company Limited

Inventors: Ching-Wei Wu, Cheng Hung Lee, Chia-Cheng Chen
Automatic control of multiple arithmetic/logic SIMD units

Patent number: 9164770

Abstract: There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays.

Type: Grant

Filed: November 23, 2009

Date of Patent: October 20, 2015

Assignee: Mindspeed Technologies, Inc.

Inventor: Patrick D. Ryan
Method for creating bit planes using a digital signal processor and double index addressing direct memory access

Patent number: 9159276

Abstract: According to one embodiment of the present invention, a method for creating bit planes from frame data for a digital mirror device is disclosed including forming data elements comprising bits of equal significance from a plurality of pixel data in the frame data, the forming including using dual index direct memory address operations.

Type: Grant

Filed: December 20, 2007

Date of Patent: October 13, 2015

Assignee: TEXAS INSTRUMENTS INCORPORATED

Inventors: James N. Malina, Leonardo W. Estevez, Gunter Schmer
NAND flash controller with programmable command templates

Patent number: 9147470

Abstract: Apparatus for programming a non-volatile memory, the apparatus having corresponding methods and tangible computer-readable media, comprise: a command memory configured to hold a plurality of command templates, wherein each of the command templates specifies a sequence of pad signals; a state machine configured to i) receive descriptors, wherein each of the descriptors includes a pointer to a respective one of the command templates in the command memory, and ii) generate the sequence of pad signals based on the command template indicated by the respective pointer; and a non-volatile memory interface configured to provide, to pads of the non-volatile memory, the sequence of pad signals generated by the state machine.

Type: Grant

Filed: November 12, 2012

Date of Patent: September 29, 2015

Assignee: MARVELL INTERNATIONAL LTD.

Inventors: Chih-Ching Chen, Hyunsuk Shin, Chi Kong Lee, Xueting Yu
Vector slot processor execution unit for high speed streaming inputs

Patent number: 9092227

Abstract: A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs.

Type: Grant

Filed: May 2, 2012

Date of Patent: July 28, 2015

Inventors: Anindya Saha, Gururaj Padaki, Santosh Billava, Rakesh A. Joshi
Unpacking packed data in multiple lanes

Patent number: 9086872

Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.

Type: Grant

Filed: June 30, 2009

Date of Patent: July 21, 2015

Assignee: Intel Corporation

Inventors: Asaf Hargil, Doron Orenstein
Unpacking packed data in multiple lanes

Patent number: 9081562

Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.

Type: Grant

Filed: March 15, 2013

Date of Patent: July 14, 2015

Assignee: Intel Corporation

Inventors: Asaf Hargil, Doron Orenstein
METHOD AND APPARATUS FOR PROCESSING SHUFFLE INSTRUCTION

Publication number: 20150127924

Abstract: A method and corresponding apparatus for processing a shuffle instruction are provided. Shuffle units are configured in a hierarchical structure, and each of the shuffle units generates a shuffled data element array by performing shuffling on an input data element array. In the hierarchical structure, which includes an upper shuffle unit and a lower shuffle unit, the shuffled data element array output from the lower shuffle unit is input to the upper shuffle unit as a portion of the input data element array for the upper shuffle unit.

Type: Application

Filed: July 14, 2014

Publication date: May 7, 2015

Applicant: SAMSUNG ELECTRONICS CO., LTD.

Inventors: Keshava PRASAD, Navneet BASUTKAR, Young Hwan PARK, Ho YANG, Yeon Bok LEE
DATA PROCESSOR AND METHOD OF LANE REALIGNMENT

Publication number: 20150100758

Abstract: A data processor includes a register file divided into at least a first portion and a second portion for storing data. A single instruction, multiple data (SIMD) unit is also divided into at least a first lane and a second lane. The first and second lanes of the SIMD unit correspond respectively to the first and second portions of the register file. Furthermore, each lane of the SIMD unit is capable of data processing. The data processor also includes a realignment element in communication with the register file and the SIMD unit. The realignment element is configured to selectively realign conveyance of data between the first portion of the register file and the first lane of the SIMD unit to the second lane of the SIMD unit.

Type: Application

Filed: October 3, 2013

Publication date: April 9, 2015

Applicant: ADVANCED MICRO DEVICES, INC.

Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
Vector compare-and-exchange operation

Patent number: 8996845

Abstract: A vector compare-and-exchange operation is performed by: decoding by a decoder in a processing device, a single instruction specifying a vector compare-and-exchange operation for a plurality of data elements between a first storage location, a second storage location, and a third storage location; issuing the single instruction for execution by an execution unit in the processing device; and responsive to the execution of the single instruction, comparing data elements from the first storage location to corresponding data elements in the second storage location; and responsive to determining a match exists, replacing the data elements from the first storage location with corresponding data elements from the third storage location.

Type: Grant

Filed: December 22, 2009

Date of Patent: March 31, 2015

Assignee: Intel Corporation

Inventors: Ravi Rajwar, Andrew T. Forsyth
Executing first instructions for smaller set of SIMD threads diverging upon conditional branch instruction

Patent number: 8959319

Abstract: Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.

Type: Grant

Filed: December 2, 2011

Date of Patent: February 17, 2015

Assignee: Advanced Micro Devices, Inc.

Inventors: Mark Leather, Norman Rubin, Brian D. Emberling, Michael Mantor
Analyze and reduce number of data reordering operations in SIMD code

Patent number: 8954943

Abstract: A method for analyzing data reordering operations in Single Issue Multiple Data source code and generating executable code therefrom is provided. Input is received. One or more data reordering operations in the input are identified and each data reordering operation in the input is abstracted into a corresponding virtual shuffle operation so that each virtual shuffle operation forms part of an expression tree. One or more virtual shuffle trees are collapsed by combining virtual shuffle operations within at least one of the one or more virtual shuffle trees to form one or more combined virtual shuffle operations, wherein each virtual shuffle tree is a subtree of the expression tree that only contains virtual shuffle operations. Then code is generated for the one or more combined virtual shuffle operations.

Type: Grant

Filed: January 26, 2006

Date of Patent: February 10, 2015

Assignee: International Business Machines Corporation

Inventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu, Peng Zhao
Vector Load and Duplicate Operations

Publication number: 20150019838

Abstract: A method of loading and duplicating scalar data from a source into a destination register. The data may be duplicated in byte, half word, word or double word parts, according to a duplication pattern.

Type: Application

Filed: July 9, 2014

Publication date: January 15, 2015

Inventors: Timothy David Anderson, Duc Quang Bui, Peter Richard Dent
DATA PROCESSING APPARATUS HAVING SIMD PROCESSING CIRCUITRY

Publication number: 20150012724

Abstract: A data processing apparatus has permutation circuitry for performing a permutation operation for changing a data element size or data element positioning of at least one source operand to generate first and second SIMD operands, and SIMD processing circuitry for performing a SIMD operation on the first and second SIMD operands. In response to a first SIMD instruction requiring a permutation operation, the instruction decoder controls the permutation circuitry to perform the permutation operation to generate the first and second SIMD operands and then controls the SIMD processing circuitry to perform the SIMD operation using these operands. In response to a second SIMD instruction not requiring a permutation operation, the instruction decoder controls the SIMD processing circuitry to perform the SIMD operation using the first and second SIMD operands identified by the instruction, without passing them via the permutation circuitry.

Type: Application

Filed: July 8, 2013

Publication date: January 8, 2015

Inventors: David Raymond LUTZ, Neil BURGESS
Multithreaded programmable direct memory access engine

Patent number: 8918553

Abstract: A mechanism programming a direct memory access engine operating as a multithreaded processor is provided. A plurality of programs is received from a host processor in a local memory associated with the direct memory access engine. A request is received in the direct memory access engine from the host processor indicating that the plurality of programs located in the local memory is to be executed. The direct memory access engine executes two or more of the plurality of programs without intervention by a host processor. As each of the two or more of the plurality of programs completes execution, the direct memory access engine sends a completion notification to the host processor that indicates that the program has completed execution.

Type: Grant

Filed: June 5, 2012

Date of Patent: December 23, 2014

Assignee: International Business Machines Corporation

Inventors: Brian K. Flachs, Harm P. Hofstee, Charles R. Johns, Matthew E. King, John S. Liberty, Brad W. Michael
Vector shuffle instructions operating on multiple lanes each having a plurality of data elements using a same set of per-lane control bits

Patent number: 8914613

Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.

Type: Grant

Filed: August 26, 2011

Date of Patent: December 16, 2014

Assignee: Intel Corporation

Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
Folded SIMD array organized in groups (PEGs) of respective array segments, control signal distribution logic, and local memory

Patent number: 8898432

Abstract: Systems and methods for folding a single instruction multiple data (SIMD) array include a newly defined processing element group (PEG) that allows interconnection of PEGs by abutment without requiring a row or column weave pattern. The interconnected PEGs form a SIMD array that is effectively folded at its center along the North-South axis, and may also be folded along the East-West axis. The folding of the array provides for north and south boundaries to be co-located and for east and west boundaries to be co-located. The co-location allows wrap-around connections to be done with a propagation distance reduced effectively to zero.

Type: Grant

Filed: October 25, 2011

Date of Patent: November 25, 2014

Assignee: Geo Semiconductor, Inc.

Inventor: Woodrow L. Meeker
Bi-directional data transfer within a single I/O operation

Patent number: 8892781

Abstract: A computer program product, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes receiving a command block from the channel subsystem, the command block including at least one input command and at least one output command specified by a transport command word (TCW) and associated with the I/O operation, the I/O operation having both input and output data, the TCW specifying a location in the memory of the output data and a location in the memory for storing the input data; receiving the output data specified by the TCW and executing the at least one output command; and forwarding the input data specified by the TCW to the channel subsystem for storage at a location specified by the TCW.

Type: Grant

Filed: June 13, 2013

Date of Patent: November 18, 2014

Assignee: International Business Machines Corporation

Inventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann, III
Efficient multi-level software cache using SIMD vector permute functionality

Patent number: 8862827

Abstract: A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location.

Type: Grant

Filed: December 29, 2009

Date of Patent: October 14, 2014

Assignee: International Business Machines Corporation

Inventors: Brian Flachs, Barry L. Minor, Mark Richard Nutter
SIMD processor for performing data filtering and/or interpolation

Patent number: 8856494

Abstract: Data processing circuit containing an instruction execution circuit having an instruction set comprising a SIMD instruction. The instruction execution circuit comprises arithmetic circuits, arranged to perform N respective identical operations in parallel in response to the SIMD instruction. The SIMD instruction selects a first one and a second one of the registers. The SIMD instruction defines a first and second series of N respective SIMD instruction operands of the SIMD instruction from the addressed registers. Each arithmetic circuit receives a respective first operand and a respective second operand from the first and second series respectively. The instruction execution circuit selects the first and second series so they partially overlap. Positioning the operands is under program control.

Type: Grant

Filed: January 11, 2012

Date of Patent: October 7, 2014

Assignee: Intel Corporation

Inventor: Antonius A. M. Van Wel
Packing lower half bits of signed data elements in two source registers in a destination register with saturation

Patent number: 8838946

Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

Type: Grant

Filed: December 29, 2012

Date of Patent: September 16, 2014

Assignee: Intel Corporation

Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
Program flow control for multiple divergent SIMD threads using a minimum resume counter

Patent number: 8832417

Abstract: This disclosure describes techniques for handling divergent thread conditions in a multi-threaded processing system. In some examples, a control flow unit may obtain a control flow instruction identified by a program counter value stored in a program counter register. The control flow instruction may include a target value indicative of a target program counter value for the control flow instruction. The control flow unit may select one of the target program counter value and a minimum resume counter value as a value to load into the program counter register. The minimum resume counter value may be indicative of a smallest resume counter value from a set of one or more resume counter values associated with one or more inactive threads. Each of the one or more resume counter values may be indicative of a program counter value at which a respective inactive thread should be activated.

Type: Grant

Filed: September 7, 2011

Date of Patent: September 9, 2014

Assignee: QUALCOMM Incorporated

Inventors: Lin Chen, David Rigel Garcia Garcia, Andrew E. Gruber, Guofang Jiao
Instruction and logic for processing text strings

Patent number: 8825987

Abstract: Methods, apparatus, and instructions for performing string comparison operations. An apparatus may include execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.

Type: Grant

Filed: December 20, 2012

Date of Patent: September 2, 2014

Assignee: Intel Corporation

Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
Instruction and logic for processing text strings

Patent number: 8819394

Abstract: Methods, apparatus, and instructions for performing string comparison operations. In one embodiment, an apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.

Type: Grant

Filed: December 20, 2012

Date of Patent: August 26, 2014

Assignee: Intel Corporation

Inventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
DATA COMPRESSION AND DECOMPRESSION USING SIMD INSTRUCTIONS

Publication number: 20140208068

Abstract: Compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.

Type: Application

Filed: January 22, 2013

Publication date: July 24, 2014

Applicant: Samplify systems, Inc.

Inventor: ALBERT W. WEGENER
SIMD INSTRUCTIONS FOR DATA COMPRESSION AND DECOMPRESSION

Publication number: 20140208069

Abstract: An execution unit configured for compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.

Type: Application

Filed: January 22, 2013

Publication date: July 24, 2014

Applicant: SAMPLIFY SYSTEMS, INC.

Inventor: ALBERT W. WEGENER
INSTRUCTION AND LOGIC TO PROVIDE VECTOR SCATTER-OP AND GATHER-OP FUNCTIONALITY

Publication number: 20140201498

Abstract: Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results.

Type: Application

Filed: September 26, 2011

Publication date: July 17, 2014

Applicant: Intel Corporation

Inventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Charles R. Yount, Suleyman Sair
INSTRUCTION AND LOGIC TO PROVIDE VECTOR LOAD-OP/STORE-OP WITH STRIDE FUNCTIONALITY

Publication number: 20140195778

Abstract: Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults.

Type: Application

Filed: September 26, 2011

Publication date: July 10, 2014

Applicant: Intel Corporation

Inventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Suleyman Sair, Charles R. Yount
Algorithm and architecture for multi-argument associative operations that minimizes the number of components using a latency of the components

Patent number: 8775147

Abstract: An algorithm and architecture are disclosed for performing multi-argument associative operations. The algorithm and architecture can be used to schedule operations on multiple facilities for computations or can be used in the development of a model in a modeling environment. The algorithm and architecture resulting from the algorithm use the latency of the components that are used to process the associative operations. The algorithm minimizes the number of components necessary to produce an output of multi-argument associative operations and also can minimize the number of inputs each component receives.

Type: Grant

Filed: May 31, 2006

Date of Patent: July 8, 2014

Assignee: The MathWorks, Inc.

Inventors: Alireza Pakyari, Brian K. Ogilvie

prev 1 2 3 4 5 6 7 … next