Single Instruction, Multiple Data (simd) Patents (Class 712/22)
-
Patent number: 9582419Abstract: A data processing device 100 comprises a plurality of storage circuits 130, 160, which store a plurality of data elements of the bits in an interleaved manner. Data processing device also comprises a consumer 110 with a number of lanes 120. The consumer is able to individually access each of the plurality of storage circuits 130, 160 in order to receive into the lanes 120 either a subset of the plurality of data elements or y bits of each of the plurality of data elements. The consumer 110 is also able to execute a common instruction of each of the plurality of lanes 120. The relationship of the bits is such that b is greater than y and is an integer multiple of y. Each of the plurality of storage circuits 130, 160 stores at most y bits of each of the data elements. Furthermore, each of the storage circuits 130, 160 stores at most y/b of the plurality of data elements. By carrying out the interleaving in this manner, the plurality of storage circuits 130, 160 comprise no more than b/y storage circuits.Type: GrantFiled: October 25, 2013Date of Patent: February 28, 2017Assignee: ARM LimitedInventors: Ganesh Suryanarayan Dasika, Rune Holm, Stephen John Hill
-
Patent number: 9575753Abstract: Mechanisms, in a data processing system comprising a single instruction multiple data (SIMD) processor, for performing a data dependency check operation on vector element values of at least two input vector registers are provided. Two calls to a simd-check instruction are performed, one with input vector registers having a first order and one with the input vector registers having a different order. The simd-check instruction performs comparisons to determine if any data dependencies are present. Results of the two calls to the simd-check instruction are obtained and used to determine if any data dependencies are present in the at least two input vector registers. Based on the results, the SIMD processor may perform various operations.Type: GrantFiled: March 15, 2012Date of Patent: February 21, 2017Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Bruce M. Fleischer
-
Patent number: 9529571Abstract: An apparatus and method for creation of reordered vectors from sequential input data for block based decimation, filtering, interpolation and matrix transposition using a memory circuit for a Single Instruction, Multiple Data (SIMD) Digital Signal Processor (DSP). This memory circuit includes a two-dimensional storage array, a rotate-and-distribute unit, a read-controller and a write to controller, to map input vectors containing sequential data elements in columns of the two-dimensional array and extract reordered target vectors from this array. The data elements and memory configuration are received from the SIMD DSP.Type: GrantFiled: October 5, 2011Date of Patent: December 27, 2016Assignee: TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)Inventors: David Van Kampen, Kees Van Berkel, Sven Goossens, Wim Kloosterhuis, Claudiu Zissulescu-Ianculescu
-
Patent number: 9507637Abstract: Disclosed are apparatus and methods for managing thread resources. A computing device can generate threads for an executable application. The computing device can receive an allocation request to allocate thread-specific memory for an executable thread of the threads, where thread-specific memory includes a call stack for the executable thread. In response to the allocation request, the computing device can: allocate the thread-specific memory and indicate that the executable thread is ready for execution. The computing device can execute the executable thread. The computing device can receive a sleep request to suspend executable thread execution. In response to the sleep request, the computing device can determine whether the allocated thread-specific memory is to be deallocated. After determining that the allocated thread-specific memory is to be deallocated: the thread-specific memory can be deallocated and an indication that the executable thread execution is suspended can be provided.Type: GrantFiled: August 8, 2013Date of Patent: November 29, 2016Assignee: Google Inc.Inventor: Winthrop Lyon Saville, III
-
Patent number: 9501449Abstract: An apparatus, computer-readable medium, and computer-implemented method for parallelization of a computer program on a plurality of computing cores includes receiving a computer program comprising a plurality of commands, decomposing the plurality of commands into a plurality of node networks, each node network corresponding to a command in the plurality of commands and including one or more nodes corresponding to execution dependencies of the command, mapping the plurality of node networks to a plurality of systolic arrays, each systolic array comprising a plurality of cells and each non-data node in each node network being mapped to a cell in the plurality of cells, and mapping each cell in each systolic array to a computing core in the plurality of computing cores.Type: GrantFiled: September 10, 2014Date of Patent: November 22, 2016Assignee: Sviral, Inc.Inventors: Solomon Harsha, Paul Master
-
Patent number: 9495160Abstract: Method, apparatus, and program means for performing a string comparison operation. An apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.Type: GrantFiled: December 5, 2014Date of Patent: November 15, 2016Assignee: Intel CorporationInventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
-
Patent number: 9489180Abstract: Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least one vector execution unit that allow for parallel execution of tasks on constant-strided memory locations. The first custom computing apparatus optimizes the code for parallelism, locality of operations, constant-strided memory accesses and vectorized execution on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.Type: GrantFiled: November 16, 2012Date of Patent: November 8, 2016Assignee: Reservoir Labs, Inc.Inventors: Muthu Baskaran, Richard A. Lethin, Benoit J. Meister, Nicolas T. Vasilache
-
Patent number: 9424034Abstract: A processor includes N-bit registers and a decode unit to receive a multiple register memory access instruction. The multiple register memory access instruction is to indicate a memory location and a register. The processor includes a memory access unit coupled with the decode unit and with the N-bit registers. The memory access unit is to perform a multiple register memory access operation in response to the multiple register memory access instruction. The operation is to involve N-bit data, in each of the N-bit registers comprising the indicated register. The operation is also to involve different corresponding N-bit portions of an M×N-bit line of memory corresponding to the indicated memory location. A total number of bits of the N-bit data in the N-bit registers to be involved in the multiple register memory access operation is to amount to at least half of the M×N-bits of the line of memory.Type: GrantFiled: June 28, 2013Date of Patent: August 23, 2016Assignee: Intel CorporationInventors: Glenn Hinton, Bret Toll, Ronak Singhal
-
Patent number: 9405539Abstract: Methods, apparatus, instructions and logic provide SIMD vector sub-byte decompression functionality. Embodiments include shuffling a first and second byte into the least significant portion of a first vector element, and a third and fourth byte into the most significant portion. Processing continues shuffling a fifth and sixth byte into the least significant portion of a second vector element, and a seventh and eighth byte into the most significant portion. Then by shifting the first vector element by a first shift count and the second vector element by a second shift count, sub-byte elements are aligned to the least significant bits of their respective bytes. Processors then shuffle a byte from each of the shifted vector elements' least significant portions into byte positions of a destination vector element, and from each of the shifted vector elements' most significant portions into byte positions of another destination vector element.Type: GrantFiled: July 31, 2013Date of Patent: August 2, 2016Assignee: Intel CorporationInventors: Tal Uliel, Elmoustapha Ould-Ahmed-Vall, Thomas Willhalm, Robert Valentine
-
Patent number: 9367318Abstract: Methods and systems are provided for managing thread execution in a processor. Multiple instructions are fetched from fetch queues. The instructions satisfy the condition that they involve fewer bits than the integer processing pathway that is used to execute them. The instructions are decoded, and divided into groups. The instructions are processed simultaneously through the pathway, such that part of the pathway is used to execute one group of instructions and another part of the pathway is used to execute another group of instructions. These parts are isolated from one another so the execution of the instructions can share the pathway and execute simultaneously and independently.Type: GrantFiled: November 3, 2015Date of Patent: June 14, 2016Assignee: Google Inc.Inventor: James Laudon
-
Method of efficiently implementing a MPEG-4 AVC deblocking filter on an array of parallel processors
Patent number: 9369725Abstract: A method for implementing a deblocking filter including the steps of (A) reading pixel values for a plurality of macroblocks of an unfiltered video frame from an input buffer into a working buffer, where the working buffer has dimensions determined by a predefined input region of the deblocking filter and a portion of the working buffer forms a filter output region of the deblocking filter, (B) sequentially processing the pixel values in the working buffer through a plurality of filter processing stages using an array of software-configurable general purpose parallel processors, where each of the plurality of filter processing stages operates on a respective set of the pixel values in the working buffer, and (C) writing filtered pixel values from the filter output region of the working buffer to an output buffer after the plurality of filter processing stages are completed.Type: GrantFiled: October 22, 2012Date of Patent: June 14, 2016Assignee: Amazon Technologies, Inc.Inventor: Brian G. Lewis -
Patent number: 9342334Abstract: A system and method for simulating new instructions without compiler support for the new instructions. A simulator detects a given region in code generated by a compiler. The given region may be a candidate for vectorization or may be a region already vectorized. In response to the detection, the simulator suspends execution of a time-based simulation. The simulator then serially executes the region for at least two iterations using a functional-based simulation and using instructions with operands which correspond to P or less lanes of single-instruction-multiple-data (SIMD) execution. The value P is a maximum number of lanes of SIMD exection supported both by the compiler. The simulator stores checkpoint state during the serial execution. In response to determining no inter-iteration memory dependencies exist, the simulator returns to the time-based simulation and resumes execution using N-wide vector instructions.Type: GrantFiled: June 22, 2012Date of Patent: May 17, 2016Assignee: Advanced Micro Devices, Inc.Inventors: Bradford M. Beckmann, Nilay Vaish, Steven K. Reinhardt
-
Patent number: 9329870Abstract: A method and circuit arrangement tightly couple together decode logic associated with multiple types of execution units and having varying priorities to enable instructions that are decoded as valid instructions for multiple types of execution units to be forwarded to a highest priority type of execution unit among the multiple types of execution units. Among other benefits, when an auxiliary execution unit is coupled to a general purpose processing core with the decode logic for the auxiliary execution unit tightly coupled with the decode logic for the general purpose processing core, the auxiliary execution unit may be used to effectively overlay new functionality for an existing instruction that is normally executed by the general purpose processing core, e.g., to patch a design flaw in the general purpose processing core or to provide improved performance for specialized applications.Type: GrantFiled: February 13, 2013Date of Patent: May 3, 2016Assignee: International Business Machines CorporationInventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
-
Patent number: 9329671Abstract: Computer system, method and computer program product for scheduling IPC activities are disclosed. In one embodiment, the computer system includes first processor and second processors that communicate with each other via IPC activities. The second processor may operate in a first mode in which the second processor is able to process IPC activities, or a second mode in which the second processor does not process IPC activities. Processing apparatus associated with the first processor identifies which of the pending IPC activities for communicating from the first processor to the second processor are not real-time sensitive, and schedules the identified IPC activities for communicating from the first processor to the second processor by delaying some of the identified IPC activities to thereby group them together. The grouped IPC activities are scheduled for communicating to the second processor during a period in which the second processor is continuously in the first mode.Type: GrantFiled: January 29, 2013Date of Patent: May 3, 2016Assignee: Nvidia CorporationInventors: Greg Heinrich, Philippe Guasch
-
Patent number: 9317296Abstract: Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.Type: GrantFiled: December 21, 2012Date of Patent: April 19, 2016Assignee: ADVANCED MICRO DEVICES, INC.Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
-
Patent number: 9311102Abstract: Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.Type: GrantFiled: July 12, 2011Date of Patent: April 12, 2016Assignee: Advanced Micro Devices, Inc.Inventors: Tushar K. Shah, Michael J. Mantor, Brian Emberling
-
Patent number: 9275984Abstract: A multi-chip package system includes a signal transmission line commonly coupled to a plurality of semiconductor chips to transfer data to/from the semiconductor chips from/to outside; and a termination controller suitable for detecting a loading value of the signal transmission line and controlling a termination operation on the signal transmission line based on the loading value.Type: GrantFiled: July 5, 2013Date of Patent: March 1, 2016Assignee: SK Hynix Inc.Inventor: Chun-Seok Jeong
-
Patent number: 9262704Abstract: Methods and systems render higher bit per pixel contone images to lower bit formats using multiple registers of a SIMD processor. The rendering process uses a first register to maintain contone image values of all the pixels being simultaneously processed. A second register maintains a threshold value used during the conversion process. A third register maintains one value for the print ready format pixels (e.g., those having less bits per pixel), and a fourth register maintains the other value (e.g., 0) for the print ready format pixels. Also, a fifth register maintains the conversion error amount for all the pixels being simultaneously processed. Sixth through ninth registers maintain distributed conversion error amounts produced by the diffusing process (for different pixels being simultaneously processed); and a tenth register maintains the pixels in the print-ready format produced by the conversion for all the pixels being simultaneously processed.Type: GrantFiled: March 4, 2015Date of Patent: February 16, 2016Assignee: Xerox CorporationInventors: David Jon Metcalfe, Ryan David Metcalfe
-
Patent number: 9229721Abstract: This disclosure is directed to techniques for executing subroutines in a single instruction, multiple data (SIMD) processing system that is subject to divergent thread conditions. In particular, a resume counter-based approach for managing divergent thread state is described that utilizes program module-specific minimum resume counters (MINRCs) for the efficient processing of control flow instructions. In some examples, the techniques of this disclosure may include using a main program MINRC to control the execution of a main program module and subroutine-specific MINRCs to control the execution of subroutine program modules. Techniques are also described for managing the main program MINRC and subroutine-specific MINRCs when subroutine call and return instructions are executed. Techniques are also described for updating a subroutine-specific MINRC to ensure that the updated MINRC value for the subroutine-specific MINRC is within the program space allocated for the subroutine.Type: GrantFiled: September 10, 2012Date of Patent: January 5, 2016Assignee: QUALCOMM IncorporatedInventor: Lin Chen
-
Patent number: 9195675Abstract: Embodiments provide methods and systems for encoding and decoding variable-length data, which may include methods for encoding and decoding search engine posting lists. Embodiments may include different encoding formats including group unary, packed unary, and/or packed binary formats. Some embodiments may utilize single instruction multiple data (SIMD) instructions that may perform a parallel shuffle operation on encoded data as part of the decoding processes. Some embodiments may utilize lookup tables to determine shuffle sequences and/or masks and/or shifts to be utilized in the decoding processes. Some embodiments may utilize hybrid formats.Type: GrantFiled: March 31, 2011Date of Patent: November 24, 2015Assignee: A9.com, Inc.Inventors: Daniel E. Rose, Alexander A. Stepanov, Anil Ramesh Gangolli, Paramjit S. Oberoi, Ryan Jacob Ernst
-
Patent number: 9189828Abstract: An accelerator system is implemented on an expansion card comprising a printed circuit board having (a) one or more graphics processing units (GPUs), (b) two or more associated memory banks (logically or physically partitioned), (c) a specialized controller, and (d) a local bus providing signal coupling compatible with the PCI industry standards. The controller handles most of the primitive operations to set up and control GPU computation. Thus, the computer's central processing unit (CPU) can be dedicated to other tasks. In this case a few controls (simulation start and stop signals from the CPU and the simulation completion signal back to CPU), GPU programs and input/output data are exchanged between CPU and the expansion card. Moreover, since on every time step of the simulation the results from the previous time step are used but not changed, the results are preferably transferred back to CPU in parallel with the computation.Type: GrantFiled: January 3, 2014Date of Patent: November 17, 2015Assignee: Neurala, Inc.Inventors: Anatoli Gorchetchnikov, Heather Marie Ames, Massimiliano Versace, Fabrizio Santini
-
Patent number: 9183907Abstract: One or more techniques for improving Vccmin for a dual port synchronous random access memory (DPSRAM) cell utilized as a single port synchronous random access memory (SPSRAM) cell are provided herein. In some embodiments, a second word line signal is sent to a second word line of the DPSRAM cell. For example, the second word line signal is sent in response to a logical low at a first bit line or a logical low at a second bit line. In this way, Vccmin is improved for the DPSRAM cell.Type: GrantFiled: November 28, 2012Date of Patent: November 10, 2015Assignee: Taiwan Semiconductor Manufacturing Company LimitedInventors: Ching-Wei Wu, Cheng Hung Lee, Chia-Cheng Chen
-
Patent number: 9164770Abstract: There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays.Type: GrantFiled: November 23, 2009Date of Patent: October 20, 2015Assignee: Mindspeed Technologies, Inc.Inventor: Patrick D. Ryan
-
Patent number: 9159276Abstract: According to one embodiment of the present invention, a method for creating bit planes from frame data for a digital mirror device is disclosed including forming data elements comprising bits of equal significance from a plurality of pixel data in the frame data, the forming including using dual index direct memory address operations.Type: GrantFiled: December 20, 2007Date of Patent: October 13, 2015Assignee: TEXAS INSTRUMENTS INCORPORATEDInventors: James N. Malina, Leonardo W. Estevez, Gunter Schmer
-
Patent number: 9147470Abstract: Apparatus for programming a non-volatile memory, the apparatus having corresponding methods and tangible computer-readable media, comprise: a command memory configured to hold a plurality of command templates, wherein each of the command templates specifies a sequence of pad signals; a state machine configured to i) receive descriptors, wherein each of the descriptors includes a pointer to a respective one of the command templates in the command memory, and ii) generate the sequence of pad signals based on the command template indicated by the respective pointer; and a non-volatile memory interface configured to provide, to pads of the non-volatile memory, the sequence of pad signals generated by the state machine.Type: GrantFiled: November 12, 2012Date of Patent: September 29, 2015Assignee: MARVELL INTERNATIONAL LTD.Inventors: Chih-Ching Chen, Hyunsuk Shin, Chi Kong Lee, Xueting Yu
-
Patent number: 9092227Abstract: A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs.Type: GrantFiled: May 2, 2012Date of Patent: July 28, 2015Inventors: Anindya Saha, Gururaj Padaki, Santosh Billava, Rakesh A. Joshi
-
Patent number: 9086872Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.Type: GrantFiled: June 30, 2009Date of Patent: July 21, 2015Assignee: Intel CorporationInventors: Asaf Hargil, Doron Orenstein
-
Patent number: 9081562Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.Type: GrantFiled: March 15, 2013Date of Patent: July 14, 2015Assignee: Intel CorporationInventors: Asaf Hargil, Doron Orenstein
-
Publication number: 20150127924Abstract: A method and corresponding apparatus for processing a shuffle instruction are provided. Shuffle units are configured in a hierarchical structure, and each of the shuffle units generates a shuffled data element array by performing shuffling on an input data element array. In the hierarchical structure, which includes an upper shuffle unit and a lower shuffle unit, the shuffled data element array output from the lower shuffle unit is input to the upper shuffle unit as a portion of the input data element array for the upper shuffle unit.Type: ApplicationFiled: July 14, 2014Publication date: May 7, 2015Applicant: SAMSUNG ELECTRONICS CO., LTD.Inventors: Keshava PRASAD, Navneet BASUTKAR, Young Hwan PARK, Ho YANG, Yeon Bok LEE
-
Publication number: 20150100758Abstract: A data processor includes a register file divided into at least a first portion and a second portion for storing data. A single instruction, multiple data (SIMD) unit is also divided into at least a first lane and a second lane. The first and second lanes of the SIMD unit correspond respectively to the first and second portions of the register file. Furthermore, each lane of the SIMD unit is capable of data processing. The data processor also includes a realignment element in communication with the register file and the SIMD unit. The realignment element is configured to selectively realign conveyance of data between the first portion of the register file and the first lane of the SIMD unit to the second lane of the SIMD unit.Type: ApplicationFiled: October 3, 2013Publication date: April 9, 2015Applicant: ADVANCED MICRO DEVICES, INC.Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
-
Patent number: 8996845Abstract: A vector compare-and-exchange operation is performed by: decoding by a decoder in a processing device, a single instruction specifying a vector compare-and-exchange operation for a plurality of data elements between a first storage location, a second storage location, and a third storage location; issuing the single instruction for execution by an execution unit in the processing device; and responsive to the execution of the single instruction, comparing data elements from the first storage location to corresponding data elements in the second storage location; and responsive to determining a match exists, replacing the data elements from the first storage location with corresponding data elements from the third storage location.Type: GrantFiled: December 22, 2009Date of Patent: March 31, 2015Assignee: Intel CorporationInventors: Ravi Rajwar, Andrew T. Forsyth
-
Patent number: 8959319Abstract: Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.Type: GrantFiled: December 2, 2011Date of Patent: February 17, 2015Assignee: Advanced Micro Devices, Inc.Inventors: Mark Leather, Norman Rubin, Brian D. Emberling, Michael Mantor
-
Patent number: 8954943Abstract: A method for analyzing data reordering operations in Single Issue Multiple Data source code and generating executable code therefrom is provided. Input is received. One or more data reordering operations in the input are identified and each data reordering operation in the input is abstracted into a corresponding virtual shuffle operation so that each virtual shuffle operation forms part of an expression tree. One or more virtual shuffle trees are collapsed by combining virtual shuffle operations within at least one of the one or more virtual shuffle trees to form one or more combined virtual shuffle operations, wherein each virtual shuffle tree is a subtree of the expression tree that only contains virtual shuffle operations. Then code is generated for the one or more combined virtual shuffle operations.Type: GrantFiled: January 26, 2006Date of Patent: February 10, 2015Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu, Peng Zhao
-
Publication number: 20150019838Abstract: A method of loading and duplicating scalar data from a source into a destination register. The data may be duplicated in byte, half word, word or double word parts, according to a duplication pattern.Type: ApplicationFiled: July 9, 2014Publication date: January 15, 2015Inventors: Timothy David Anderson, Duc Quang Bui, Peter Richard Dent
-
Publication number: 20150012724Abstract: A data processing apparatus has permutation circuitry for performing a permutation operation for changing a data element size or data element positioning of at least one source operand to generate first and second SIMD operands, and SIMD processing circuitry for performing a SIMD operation on the first and second SIMD operands. In response to a first SIMD instruction requiring a permutation operation, the instruction decoder controls the permutation circuitry to perform the permutation operation to generate the first and second SIMD operands and then controls the SIMD processing circuitry to perform the SIMD operation using these operands. In response to a second SIMD instruction not requiring a permutation operation, the instruction decoder controls the SIMD processing circuitry to perform the SIMD operation using the first and second SIMD operands identified by the instruction, without passing them via the permutation circuitry.Type: ApplicationFiled: July 8, 2013Publication date: January 8, 2015Inventors: David Raymond LUTZ, Neil BURGESS
-
Patent number: 8918553Abstract: A mechanism programming a direct memory access engine operating as a multithreaded processor is provided. A plurality of programs is received from a host processor in a local memory associated with the direct memory access engine. A request is received in the direct memory access engine from the host processor indicating that the plurality of programs located in the local memory is to be executed. The direct memory access engine executes two or more of the plurality of programs without intervention by a host processor. As each of the two or more of the plurality of programs completes execution, the direct memory access engine sends a completion notification to the host processor that indicates that the program has completed execution.Type: GrantFiled: June 5, 2012Date of Patent: December 23, 2014Assignee: International Business Machines CorporationInventors: Brian K. Flachs, Harm P. Hofstee, Charles R. Johns, Matthew E. King, John S. Liberty, Brad W. Michael
-
Patent number: 8914613Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: August 26, 2011Date of Patent: December 16, 2014Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 8898432Abstract: Systems and methods for folding a single instruction multiple data (SIMD) array include a newly defined processing element group (PEG) that allows interconnection of PEGs by abutment without requiring a row or column weave pattern. The interconnected PEGs form a SIMD array that is effectively folded at its center along the North-South axis, and may also be folded along the East-West axis. The folding of the array provides for north and south boundaries to be co-located and for east and west boundaries to be co-located. The co-location allows wrap-around connections to be done with a propagation distance reduced effectively to zero.Type: GrantFiled: October 25, 2011Date of Patent: November 25, 2014Assignee: Geo Semiconductor, Inc.Inventor: Woodrow L. Meeker
-
Patent number: 8892781Abstract: A computer program product, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes receiving a command block from the channel subsystem, the command block including at least one input command and at least one output command specified by a transport command word (TCW) and associated with the I/O operation, the I/O operation having both input and output data, the TCW specifying a location in the memory of the output data and a location in the memory for storing the input data; receiving the output data specified by the TCW and executing the at least one output command; and forwarding the input data specified by the TCW to the channel subsystem for storage at a location specified by the TCW.Type: GrantFiled: June 13, 2013Date of Patent: November 18, 2014Assignee: International Business Machines CorporationInventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann, III
-
Patent number: 8862827Abstract: A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location.Type: GrantFiled: December 29, 2009Date of Patent: October 14, 2014Assignee: International Business Machines CorporationInventors: Brian Flachs, Barry L. Minor, Mark Richard Nutter
-
Patent number: 8856494Abstract: Data processing circuit containing an instruction execution circuit having an instruction set comprising a SIMD instruction. The instruction execution circuit comprises arithmetic circuits, arranged to perform N respective identical operations in parallel in response to the SIMD instruction. The SIMD instruction selects a first one and a second one of the registers. The SIMD instruction defines a first and second series of N respective SIMD instruction operands of the SIMD instruction from the addressed registers. Each arithmetic circuit receives a respective first operand and a respective second operand from the first and second series respectively. The instruction execution circuit selects the first and second series so they partially overlap. Positioning the operands is under program control.Type: GrantFiled: January 11, 2012Date of Patent: October 7, 2014Assignee: Intel CorporationInventor: Antonius A. M. Van Wel
-
Patent number: 8838946Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.Type: GrantFiled: December 29, 2012Date of Patent: September 16, 2014Assignee: Intel CorporationInventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
-
Patent number: 8832417Abstract: This disclosure describes techniques for handling divergent thread conditions in a multi-threaded processing system. In some examples, a control flow unit may obtain a control flow instruction identified by a program counter value stored in a program counter register. The control flow instruction may include a target value indicative of a target program counter value for the control flow instruction. The control flow unit may select one of the target program counter value and a minimum resume counter value as a value to load into the program counter register. The minimum resume counter value may be indicative of a smallest resume counter value from a set of one or more resume counter values associated with one or more inactive threads. Each of the one or more resume counter values may be indicative of a program counter value at which a respective inactive thread should be activated.Type: GrantFiled: September 7, 2011Date of Patent: September 9, 2014Assignee: QUALCOMM IncorporatedInventors: Lin Chen, David Rigel Garcia Garcia, Andrew E. Gruber, Guofang Jiao
-
Patent number: 8825987Abstract: Methods, apparatus, and instructions for performing string comparison operations. An apparatus may include execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.Type: GrantFiled: December 20, 2012Date of Patent: September 2, 2014Assignee: Intel CorporationInventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
-
Patent number: 8819394Abstract: Methods, apparatus, and instructions for performing string comparison operations. In one embodiment, an apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.Type: GrantFiled: December 20, 2012Date of Patent: August 26, 2014Assignee: Intel CorporationInventors: Michael A. Julier, Jeffrey D. Gray, Srinivas Chennupaty, Sean P. Mirkes, Mark P. Seconi
-
Publication number: 20140208068Abstract: Compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.Type: ApplicationFiled: January 22, 2013Publication date: July 24, 2014Applicant: Samplify systems, Inc.Inventor: ALBERT W. WEGENER
-
Publication number: 20140208069Abstract: An execution unit configured for compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks.Type: ApplicationFiled: January 22, 2013Publication date: July 24, 2014Applicant: SAMPLIFY SYSTEMS, INC.Inventor: ALBERT W. WEGENER
-
Publication number: 20140201498Abstract: Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results.Type: ApplicationFiled: September 26, 2011Publication date: July 17, 2014Applicant: Intel CorporationInventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Charles R. Yount, Suleyman Sair
-
Publication number: 20140195778Abstract: Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults.Type: ApplicationFiled: September 26, 2011Publication date: July 10, 2014Applicant: Intel CorporationInventors: Elmoustapha Ould-Ahmed-Vall, Kshitij A. Doshi, Suleyman Sair, Charles R. Yount
-
Patent number: 8775147Abstract: An algorithm and architecture are disclosed for performing multi-argument associative operations. The algorithm and architecture can be used to schedule operations on multiple facilities for computations or can be used in the development of a model in a modeling environment. The algorithm and architecture resulting from the algorithm use the latency of the components that are used to process the associative operations. The algorithm minimizes the number of components necessary to produce an output of multi-argument associative operations and also can minimize the number of inputs each component receives.Type: GrantFiled: May 31, 2006Date of Patent: July 8, 2014Assignee: The MathWorks, Inc.Inventors: Alireza Pakyari, Brian K. Ogilvie