Vector Processor Operation Patents (Class 712/7)
  • Publication number: 20130024656
    Abstract: Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a Boolean operation on another input vector dependent upon the input vector and the control vector.
    Type: Application
    Filed: September 27, 2012
    Publication date: January 24, 2013
    Applicant: APPLE INC.
    Inventor: Apple Inc.
  • Publication number: 20130024654
    Abstract: A processor, method, and medium for using vector operations to compress selected elements of a vector. An input vector is compared to a criteria vector, and then a subset of the plurality of elements of the input vector are selected based on the comparison. A permutation vector is generated based on the locations of the selected elements and then the permutation vector is used to permute the selected elements of the input vector to an output vector. The selected elements of the input vector are stored in contiguous locations in the leftmost elements of the output vector. Then, the output vector is stored to memory and a pointer to the memory location is incremented by the number of selected elements.
    Type: Application
    Filed: July 20, 2011
    Publication date: January 24, 2013
    Inventor: Darryl J. Gove
  • Patent number: 8356160
    Abstract: Embodiments of the invention provide methods and apparatus for executing a multiple operand minimum or maximum instructions. Executing the multiple operand minimum or maximum instruction comprises transferring more than two operands to one or more processing lanes of a vector unit. A first compare operation may be performed in at least one processing lane of the vector unit to determine a greater or smaller of a first operand and a second operand. The greater (or smaller) operand may be transferred to a dot product unit, wherein, in a second compare operation, the transferred operand is compared to at least a third operand to determine one of the greater and smaller of the more than two operands.
    Type: Grant
    Filed: January 15, 2008
    Date of Patent: January 15, 2013
    Assignee: International Business Machines Corporation
    Inventors: Adam J. Muff, Matthew R. Tubbs
  • Patent number: 8316215
    Abstract: It is an object to speed up a vector store instruction on a memory that is divided into banks as setting a plurality of elements as a unit while minimizing an increase in physical quantity. A vector processing apparatus has a plurality of register banks and processes a data string including a plurality of data elements retained in the plurality of register banks, wherein: the plurality of register banks each have a read pointer 113 that points to a read position for reading the data elements; and the start position of the read pointer 113 is changed from one register bank to another. For example, consecutive numbers assigned to the register banks may be used as the read start positions of the respective register banks.
    Type: Grant
    Filed: March 7, 2008
    Date of Patent: November 20, 2012
    Assignee: NEC Corporation
    Inventor: Noritaka Hoshi
  • Patent number: 8316216
    Abstract: In one embodiment, the present invention includes an apparatus having a register file to store vector data, an address generator coupled to the register file to generate addresses for a vector memory operation, and a controller to generate an output slice from one or more slices each including multiple addresses, where the output slice includes addresses each corresponding to a separately addressable portion of a memory. Other embodiments are described and claimed.
    Type: Grant
    Filed: October 21, 2009
    Date of Patent: November 20, 2012
    Assignee: Intel Corporation
    Inventors: Roger Espasa, Joel Emer, Geoff Lowney, Roger Gramunt, Santiago Galan, Toni Juan, Jesus Corbal, Federico Ardanaz, Isaac Hernandez
  • Patent number: 8312442
    Abstract: A computing system has an amount of shared cache, and performs runtime automatic parallelization wherein when a parallelized loop is encountered, a main thread shares the workload with at least one other non-main thread. A method for providing interprocedural prefetching includes compiling source code to produce compiled code having a main thread including a parallelized loop. Prior to the parallelized loop in the main thread, the main thread includes prefetching instructions for the at least one other non-main thread that shares the workload of the parallelized loop. As a result, the main thread prefetches data into the shared cache for use by the at least one other non-main thread.
    Type: Grant
    Filed: December 10, 2008
    Date of Patent: November 13, 2012
    Assignee: Oracle America, Inc.
    Inventors: Yonghong Song, Spiros Kalogeropulos, Partha P. Tirumalai
  • Publication number: 20120284487
    Abstract: A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs.
    Type: Application
    Filed: May 2, 2012
    Publication date: November 8, 2012
    Applicant: Saankhya Labs Private Limited
    Inventors: Anindya SAHA, Gururaj PADAKI, Santosh BILLAVA, Rakesh A. JOSHI
  • Patent number: 8296548
    Abstract: A method for locating an extreme value data chunk within a data block, the method includes: fetching, by a processor, an instruction; fetching, in response to a content of the instruction, a data unit that comprises multiple data chunks; selectively masking the fetched data chunks in response to a value of a mask; comparing, by a hardware accelerator, between values of valid data chunks to provide a extreme value data chunk; wherein valid data chunks include un-masked data chunks that belong to the data block; updating the value of the mask and jumping to the stage of fetching a new data unit, until the whole data block is fetched.
    Type: Grant
    Filed: January 18, 2006
    Date of Patent: October 23, 2012
    Assignee: Freescale Semiconductor, Inc.
    Inventors: Moti Dvir, Evgeni Ginzburg, Adi Katz
  • Patent number: 8264391
    Abstract: A signal converting system has a multi-segment digital to analog converter coupled to an error shaping loop. A control value is received at a vector processor that indicates a number N of elements that are to be selected from a vector having M elements. The elements of the vector are sorted into a bitonic sequence and separated into a larger value group and a smaller value group using a bitonic split. Only the larger value group is sorted into an ordered sequence with repeated bitonic splits when the control value is less than M/2, and N largest elements are selected from the ordered sequence. Only the smaller value group is sorted into an ordered sequence with repeated bitonic splits when the control value is greater than M/2, and N?M/2 largest elements are selected from the ordered sequence.
    Type: Grant
    Filed: October 12, 2010
    Date of Patent: September 11, 2012
    Assignee: Texas Instruments Incorporated
    Inventor: Yanto Suryono
  • Publication number: 20120221830
    Abstract: A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
    Type: Application
    Filed: February 29, 2012
    Publication date: August 30, 2012
    Applicant: CRAY INC.
    Inventors: Gregory J. Faanes, Eric P. Lundberg, Abdulla Bataineh, Timothy J. Johnson, Michael Parker, James Robert Kohn, Steven L. Scott, Robert Alverson
  • Publication number: 20120216011
    Abstract: An apparatus, method, and medium for performing a vector operation on portions of one or more source vector registers. A vector unit performs an operation on the source vector registers and only stores results in the target vector register for elements which are selected by the vector operation mask. The vector operation mask can be read by the vector unit or loaded into the vector unit for each instruction cycle. The vector operation mask allows the vector unit to be used with partially filled source vector registers and eliminates the need for scalar operations to be performed on vector data.
    Type: Application
    Filed: February 18, 2011
    Publication date: August 23, 2012
    Inventors: Darryl Gove, David Weaver
  • Publication number: 20120192005
    Abstract: The described embodiments provide a processor that executes vector instructions. In the described embodiments, the processor initializes an architectural fault-status register (FSR) and a shadow copy of the architectural FSR by setting each of N bit positions in the architectural FSR and the shadow copy of the architectural FSR to a first predetermined value. The processor then executes a first first-faulting or non-faulting (FF/NF) vector instruction. While executing the first vector instruction, the processor also executes one or more subsequent FF/NF instructions. In these embodiments, when executing the first vector instruction and the subsequent vector instructions, the processor updates one or more bit positions in the shadow copy of the architectural FSR to a second predetermined value upon encountering a fault condition.
    Type: Application
    Filed: April 20, 2011
    Publication date: July 26, 2012
    Applicant: APPLE INC.
    Inventor: Jeffry E. Gonion
  • Publication number: 20120166761
    Abstract: A processing core implemented on a semiconductor chip is described having first execution unit logic circuitry that includes first comparison circuitry to compare each element in a first input vector against every element of a second input vector. The processing core also has second execution logic circuitry that includes second comparison circuitry to compare a first input value against every data element of an input vector.
    Type: Application
    Filed: December 22, 2010
    Publication date: June 28, 2012
    Inventors: Christopher J. Hughes, Mark J. Charney, Yen-Kuang Chen, Jesus Corbal, Andrew T. Forsyth, Milind B. Girkar, Jonathan C. Hall, Hideki Ido, Robert Valentine, Jeffrey Wiedemeier
  • Patent number: 8209525
    Abstract: The described embodiments provide a system that executes program code. While executing program code, the processor encounters at least one vector instruction and at least one vector-control instruction. The vector instruction includes a set of elements, wherein each element is used to perform an operation for a corresponding iteration of a loop in the program code. The vector-control instruction identifies elements in the vector instruction that may be operated on in parallel without causing an error due to a runtime data dependency between the iterations of the loop. The processor then executes the loop by repeatedly executing the vector-control instruction to identify a next group of elements that can be operated on in the vector instruction and selectively executing the vector instruction to perform the operation for the next group of elements in the vector instruction, until the operation has been performed for all elements of the vector instruction.
    Type: Grant
    Filed: April 7, 2009
    Date of Patent: June 26, 2012
    Assignee: Apple Inc.
    Inventors: Jeffry E. Gonion, Keith E. Diefendorff, Jr.
  • Publication number: 20120131308
    Abstract: A device system and method for processing program instructions, for example, to execute intra vector operations. A fetch unit may receive a program instruction defining different operations on data elements stored at the same vector memory address. A processor may include different types of execution units each executing a different one of a predetermined plurality of elemental instructions. Each program instruction may be a combination of one or more of the elemental instructions. The processor may receive a vector of data elements stored non-consecutively at the same vector memory address to be processed by a same one of the elemental instructions and a vector of configuration values independently associated with executing the same elemental instruction on the non-consecutive data elements. At least two configuration values may be different to implement different operations by executing the same elemental instruction using the different configuration values on the vector of non-consecutive data elements.
    Type: Application
    Filed: November 18, 2010
    Publication date: May 24, 2012
    Inventors: Yaakov Dekter, Michael Boukaya, Shai Shpigelblat, Moshe Steinberg
  • Publication number: 20120124332
    Abstract: A vector processing circuit includes a vector register file including a plurality of array elements, a command issuance control circuit, and a plurality of pipeline arithmetic units. Each pipeline arithmetic unit performs arithmetic processing of data stored in the array elements indicated as a source by one command in parts through a plurality of cycles and stores the result in the array elements indicated as a destination by the one command through a plurality of cycles. When data word length of a preceding command is longer than that of a subsequent command, the command issuance control circuit changes data sizes of the array elements in accordance with data word length of the command and determines whether there is register interference between the array element to be processed at a non-head cycle of the preceding command, and the array element to be processed at a head cycle of the subsequent command.
    Type: Application
    Filed: October 24, 2011
    Publication date: May 17, 2012
    Applicant: FUJITSU LIMITED
    Inventors: GE Yi, Yoshimasa Takebe, Hiromasa Takahashi
  • Patent number: 8169439
    Abstract: Embodiments of the invention are generally related to image processing, and more specifically to vector units for supporting image processing. A combined vector/scalar unit is provided wherein one or more processing lanes of the vector unit are used for performing scalar operations. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated and a significant amount of chip area is saved.
    Type: Grant
    Filed: October 23, 2007
    Date of Patent: May 1, 2012
    Assignee: International Business Machines Corporation
    Inventors: David Arnold Luick, Eric Oliver Mejdrich, Adam James Muff
  • Publication number: 20120102299
    Abstract: A processing system includes processors and dynamically configurable communication elements (DCCs) coupled together in an interspersed arrangement. A source device may transfer a data item through an intermediate subset of the DCCs to a destination device. The source and destination devices may each correspond to different processors, DCCs, or input/output devices, or mixed combinations of these. In response to detecting a stall after the source device begins transfer of the data item to the destination device and prior to receipt of all of the data item at the destination device, a stalling device is operable to propagate stalling information through one or more of the intermediate subset towards the source device. In response to receiving the stalling information, at least one of the intermediate subset is operable to buffer all or part of the data item.
    Type: Application
    Filed: December 30, 2011
    Publication date: April 26, 2012
    Inventors: Michael B. Doerr, William H. Hallidy, David A. Gibson, Craig M. Chase
  • Publication number: 20120086591
    Abstract: A signal converting system has a multi-segment digital to analog converter coupled to an error shaping loop. A control value is received at a vector processor that indicates a number N of elements that are to be selected from a vector having M elements. The elements of the vector are sorted into a bitonic sequence and separated into a larger value group and a smaller value group using a bitonic split. Only the larger value group is sorted into an ordered sequence with repeated bitonic splits when the control value is less than M/2, and N largest elements are selected from the ordered sequence. Only the smaller value group is sorted into an ordered sequence with repeated bitonic splits when the control value is greater than M/2, and N?M/2 largest elements are selected from the ordered sequence.
    Type: Application
    Filed: October 12, 2010
    Publication date: April 12, 2012
    Inventor: Yanto Suryono
  • Patent number: 8156310
    Abstract: One embodiment of the present method and apparatus for data stream alignment support includes retrieving a first input from a first register file, retrieving a second input from a second register file, the second register file being dedicated to a stream shift unit and performing the stream shift instruction in accordance with the first input, the second input and a third input.
    Type: Grant
    Filed: September 11, 2006
    Date of Patent: April 10, 2012
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John-David Wellman, Peng Wu
  • Publication number: 20120079233
    Abstract: A semiconductor processor is described. The semiconductor processor includes logic circuitry to perform a logical reduction instruction. The logic circuitry has swizzle circuitry to swizzle a vector's elements so as to form a swizzle vector. The logic circuitry also has vector logic circuitry to perform a vector logic operation on said vector and said swizzle vector.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 29, 2012
    Inventors: Jeff Wiedemeier, Sridhan Samudrala, Roger Golliver
  • Patent number: 8131981
    Abstract: A data processing system, apparatus and method for performing fractional multiply operations is disclosed. The system includes a memory that stores instructions for SIMD operations and a processing core. The processing core includes registers that store operands for the fractional multiply operations. A coprocessor included in the processing core performs the fractional multiply operations on the operands and stores the result in a destination register that is also included in the processing core.
    Type: Grant
    Filed: August 12, 2009
    Date of Patent: March 6, 2012
    Assignee: Marvell International Ltd.
    Inventors: Nigel C. Paver, Bradley C. Aldrich
  • Patent number: 8131979
    Abstract: The described embodiments provide a system that determines data dependencies between two vector memory operations or two memory operations that use vectors of memory addresses. During operation, the system receives a first input vector and a second input vector. The first input vector includes a number of elements containing memory addresses for a first memory operation, while the second input vector includes a number of elements containing memory addresses for a second memory operation, wherein the first memory operation occurs before the second memory operation in program order. The system then determines elements in the first and second input vectors where the memory addresses indicate that a dependency exists between the memory operations. The system next generates a result vector, wherein the result vector indicates the elements where dependencies exist between the memory operations.
    Type: Grant
    Filed: April 7, 2009
    Date of Patent: March 6, 2012
    Assignee: Apple Inc.
    Inventors: Jeffry E. Gonion, Keith E. Diefendorff, Jr.
  • Publication number: 20120023308
    Abstract: Provided is a parallel comparison/selection operation apparatus which efficiently executes a search for a maximum value or a search for a minimum value with an index. The parallel comparison/selection operation apparatus includes a vector comparison/selection unit 242 that compares each element included in vector data 1 and vector data 2 for each corresponding element using the vector data 1 and the vector data 2, selects one element of the vector data 1 and the vector data 2 based on the comparison result, and generates vector data 3 including the selected element, and an index vector selection unit 243 that selects one element of an index vector 1 and an index vector 2 based on the comparison result vector using the index vector 1 of the vector data 1, the index vector 2 of the vector data 2, and the comparison result vector to generate and output an index vector 3 including the selected element.
    Type: Application
    Filed: January 25, 2010
    Publication date: January 26, 2012
    Applicants: RENESAS ELECTRONICS CORPORATION, NEC CORPORATION
    Inventors: Takahiro Kumura, Hideki Matsuyama
  • Patent number: 8103852
    Abstract: An information handling system includes a processor with a bifurcated unified issue queue that may perform unified issue queue VSU store instruction dependency operations. The bifurcated unified issue queue BUIQ maintains VSU store instructions in the form of internal operations data. The BUIQ includes a unified issue queue UIQ 0 and a unified issue queue UIQ 1. The BUIQ may manage a particular VSU store instruction from one UIQ to determine data dependencies and employ the other UIQ to determine address dependencies of that particular VSU store instruction. The UIQs employ a dependency matrix including a dependency array. The dependency array data maintains both data and address dependency information. The particular VSU store instruction issues to execution units such as VSUs for data dependency information and load store units (LSUs) for address dependency information. A particular VSU store instruction may execute to provide data dependency information independent of address dependency information.
    Type: Grant
    Filed: December 22, 2008
    Date of Patent: January 24, 2012
    Assignee: International Business Machines Corporation
    Inventors: James Wilson Bishop, Mary Douglass Brown, William Elton Burky, Todd Alan Venton
  • Patent number: 8094768
    Abstract: The present invention discloses a novel multi-channel timing recovery scheme that utilizes a shared CORDIC to accurately compute the phase for each tone. Then a hardware-based linear combiner module is used to reconstruct the best phase estimate from multiple phase measurements. The firmware monitors the noise variance for the pilot tones and determines the corresponding weight for each tone to ensure that the minimum phase jitter noise is achieved through the linear combiner. Then a hardware-based second-order timing recovery control loop generates the frequency reference signal for VCXO or DCXO. A single sequentially controlled multiplier is used for all multiplications in the control loop.
    Type: Grant
    Filed: December 21, 2006
    Date of Patent: January 10, 2012
    Assignee: Triductor Technology (Suzhou) Inc.
    Inventor: Yaolong Tan
  • Publication number: 20110320765
    Abstract: A computer processor, method, and computer program product for executing vector processing instructions on a variable width vector register file. An example embodiment is a computer processor that includes an instruction execution unit coupled to a variable width vector register file which contains a number of vector registers, the width of the vector registers is changeable during operation of the computer processor.
    Type: Application
    Filed: June 28, 2010
    Publication date: December 29, 2011
    Applicant: International Business Machines Corporation
    Inventors: Tejas Karkhanis, Jose E. Moreira, Valentina Salapura
  • Publication number: 20110314254
    Abstract: The present application relates to a method for processing data in a vector processor. The present application relates also to a vector processor for performing said method and a cellular communication device comprising said vector processor. The method for processing data in a vector processor comprises executing segmented operations on a segment of a vector for generating results, collecting the results of the segmented operations, and delivering the results in a result vector in such a way that subsequent operations remain processing in vector mode.
    Type: Application
    Filed: May 29, 2009
    Publication date: December 22, 2011
    Applicant: NXP B.V.
    Inventors: Mahima Smriti, Jean-Paul Charles Francois Hubert Smeets, Willem Egbert Hendrik Kloosterhuis
  • Patent number: 8074051
    Abstract: A multithreaded processor comprises a plurality of hardware thread units, an instruction decoder coupled to the thread units for decoding instructions received therefrom, and a plurality of execution units for executing the decoded instructions. The multithreaded processor is configured for controlling an instruction issuance sequence for threads associated with respective ones of the hardware thread units. On a given processor clock cycle, only a designated one of the threads is permitted to issue one or more instructions, but the designated thread that is permitted to issue instructions varies over a plurality of clock cycles in accordance with the instruction issuance sequence. The instructions are pipelined in a manner which permits at least a given one of the threads to support multiple concurrent instruction pipelines.
    Type: Grant
    Filed: April 1, 2005
    Date of Patent: December 6, 2011
    Assignee: Aspen Acquisition Corporation
    Inventors: Erdem Hokenek, Mayan Moudgill, Michael J. Schulte, C. John Glossner
  • Patent number: 8065502
    Abstract: A macroscalar processor architecture is described herein. In one embodiment, a processor receives instructions of a program loop having a vector block and a sequence block intended to be executed after the vector block, where the processor includes multiple slices and each of the slices is capable of executing an instruction of an iteration of the program loop substantially in parallel. For each iteration of the program loop, the processor executes an instruction of the sequence block using one of the slices while executing instructions of the vector block using a remainder of the slices substantially in parallel. Other methods and apparatuses are also described.
    Type: Grant
    Filed: November 6, 2009
    Date of Patent: November 22, 2011
    Assignee: Apple Inc.
    Inventor: Jeffry E. Gonion
  • Publication number: 20110219207
    Abstract: A reconfigurable processor for efficiently performing a vector operation, and a method of controlling the reconfigurable processor are provided. The reconfigurable processor designates at least one of a plurality of processing elements as a vector lane based on vector lane configuration information, and allocates a vector operation to the designated vector lane.
    Type: Application
    Filed: January 10, 2011
    Publication date: September 8, 2011
    Applicant: Samsung Electronics Co., Ltd.
    Inventors: Dong-Kwan Suh, Hyeong-Seok Yu, Suk-Jin Kim
  • Patent number: 7971197
    Abstract: A digital computer system automatically creates an Instruction Set Architecture (ISA) that potentially exploits VLIW instructions, vector operations, fused operations, and specialized operations with the goal of increasing the performance of a set of applications while keeping hardware cost below a designer specified limit, or with the goal of minimizing hardware cost given a required level of performance.
    Type: Grant
    Filed: August 18, 2005
    Date of Patent: June 28, 2011
    Assignee: Tensilica, Inc.
    Inventors: David William Goodwin, Dror Maydan, Ding-Kai Chen, Darin Stamenov Petkov, Steven Weng-Kiang Tjiang, Peng Tu, Christopher Rowen
  • Publication number: 20110153980
    Abstract: To provide a device to reconfigure multi-level logic networks, which enable logic modification and reconfiguration of a multi-level logic network with small circuit area and low-power dissipation in a simple manner. For example, in the case of reconfiguring a multi-level logic network following logic modification for deleting an output vector F(b) of an objective logic function F(X) corresponding to an input vector b, unmodified pq elements are selected one by one from the nearest pq element EG to an output side. At this time, among output values of pq elements closer to an input side than selected pq elements, output values corresponding to the input vector, which equal an output value corresponding to any input variable X other than the input vector b are considered modified and thus not selected. Then, a selected output value corresponding to the input vector b is rewritten to an “invalid value”.
    Type: Application
    Filed: March 2, 2007
    Publication date: June 23, 2011
    Applicant: KYUSHU INSTITUTE OF TECHNOLOGY
    Inventors: Tsutomu Sasao, Kazuto Ishida
  • Publication number: 20110113217
    Abstract: The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a first input vector, a second input vector, and optionally receiving a predicate vector (each of which includes N elements) as inputs. The processor then executes the vector instruction. Executing the vector instruction causes the processor to generate a result vector. When generating the result vector, if the predicate vector was received, for each element of the result vector for which the corresponding element of the predicate vector is active, otherwise, for each element of the result vector, the processor determines elements that are to be set in the result vector based on values in elements in the first input vector and the second input vector. The processor then sets the determined elements of the result vector to a first predetermined value.
    Type: Application
    Filed: January 13, 2011
    Publication date: May 12, 2011
    Applicant: APPLE INC.
    Inventors: Jeffry E. Gonion, Keith E. Diefendorff
  • Patent number: 7937359
    Abstract: A method of operating a Linear Complementarity Problem (LCP) solver is disclosed, where the LCP solver is characterized by multiple execution units operating in parallel to implement a competent computational method adapted to resolve physics-based LCPs in real-time.
    Type: Grant
    Filed: April 27, 2009
    Date of Patent: May 3, 2011
    Assignee: NVIDIA Corporation
    Inventors: Lihua Zhang, Richard Tonge, Dilip Sequeira, Monier Maher
  • Publication number: 20110055517
    Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.
    Type: Application
    Filed: August 26, 2009
    Publication date: March 3, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
  • Patent number: 7895379
    Abstract: Control logic of a node controller receives an input vector and produces an output vector. The control logic includes a plurality of tied control store entries including hard-coded logic to identify unique values of the input vector and to produce the output vector from a hard-coded output vector when the input vector is identified and when the tied control store is enabled. The control logic also includes a plurality of spare control store entries including programmable logic configurable to identify values of the input vector and to produce the output vector from a programmable output vector when the input vector is identified and when the spare control store is enabled. One of the spare control store entries that is configured to identify a value of the input vector that none of the tied control store entries that are enabled by the entry-enables register are configured to identify is enabled.
    Type: Grant
    Filed: December 23, 2008
    Date of Patent: February 22, 2011
    Assignee: Unisys Corporation
    Inventors: Ross M. Weber, David R. Spatafore
  • Publication number: 20110035568
    Abstract: The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a vector instruction that uses a first input vector, a second input vector, and a control vector, and optionally a predicate vector as inputs, wherein each of the vectors includes N elements. The processor then executes the vector instruction. In the described embodiments, when executing the vector instruction, the processor determines a key element position. If the predicate vector is received, the key element position is a predetermined active element position in the predicate vector, otherwise, the key element position is in a predetermined element position. The processor then uses the key element position to copy a result value into a result variable.
    Type: Application
    Filed: October 19, 2010
    Publication date: February 10, 2011
    Applicant: APPLE INC.
    Inventors: Jeffry E. Gonion, Keith E. Diefendorff
  • Patent number: 7877573
    Abstract: One embodiment of the present invention sets forth a technique for computing a parallel prefix sum using one or more cooperative thread arrays (CTA) within a graphics processing unit. The prefix sum input list is partitioned and distributed to each CTA. Within each CTA, the input list is further partitioned for processing by individual threads in a way that avoids access conflicts to memory. Each list partition within the CTA is assigned to one of a plurality of concurrent threads, which executes a prefix sum operation the partition. The final values of the prefix sum operations form a list that is then subjected to a second prefix sum operation. Each element of the second prefix sum operation is added to each element of the subsequent partition, completing the prefix sum operation within the CTA. This technique may be extended to prefix sum operations that span two or more CTAs.
    Type: Grant
    Filed: August 8, 2007
    Date of Patent: January 25, 2011
    Assignee: NVIDIA Corporation
    Inventor: Scott M. Le Grand
  • Patent number: 7873812
    Abstract: The new system provides for efficient implementation of matrix multiplication in a SIMD processor. The new system provides ability to map any element of a source vector register to be paired with any element of a second source vector register for vector operations, and specifically vector multiply and vector-multiply-accumulate operations to implement a variety of matrix multiplications without the additional permute or data re-ordering instructions. Operations such as DCT and Color-space transformations for video processing could be very efficiently implemented using this system.
    Type: Grant
    Filed: April 5, 2004
    Date of Patent: January 18, 2011
    Inventor: Tibet Mimar
  • Patent number: 7865898
    Abstract: A system that reduces execution time of a parallel SVM application. During operation, the system partitions an input data set into chunks of data. Next, the system distributes the partitioned chunks of data across a plurality of available computing nodes and executes the parallel SVM application on the chunks of data in parallel across the plurality of available computing nodes. The system then determines if a first timeout period has been exceeded before all of the plurality of available computing nodes have finished processing their respective chunks of data. If so, the system (1) repartitions the input data set into different chunks of data; (2) redistributes the repartitioned chunks of data across some or all of the plurality of available computing nodes; and (3) executes the parallel SVM application on the repartitioned chunks of data in parallel across some or all of the available computing nodes.
    Type: Grant
    Filed: January 27, 2006
    Date of Patent: January 4, 2011
    Assignee: Oracle America, Inc.
    Inventors: Kalyanaraman Vaidyanathan, Kenny C. Gross
  • Publication number: 20100318764
    Abstract: A system and method of compiling program code, wherein the program code includes an operation on an array of data elements stored in memory of a computer system. The program code is scanned for operations that are vectorizable. The vectorizable operations are examined to determine whether they should be executed at least in part in a vector atomic memory operation (AMO) functional unit attached to memory. If so, the compiled code includes vector AMO instructions.
    Type: Application
    Filed: June 12, 2009
    Publication date: December 16, 2010
    Applicant: Cray Inc.
    Inventor: Terry D. Greyzck
  • Publication number: 20100312988
    Abstract: A data processing apparatus and method and provided for handling vector instructions. The data processing apparatus has a register data store with a plurality of registers arranged to store data elements. A vector processing unit is then used to execute a sequence of vector instructions, with the vector processing unit having a plurality of lanes of parallel processing and having access to the register data store in order to read data elements from, and write data elements to, the register data store during the execution of the sequence of vector instructions. A skip indication storage maintains a skip indicator for each of the lanes of parallel processing. The vector processing unit is responsive to a vector skip instruction to perform an update operation to set within the skip indication storage the skip indicator for a determined one or more lanes.
    Type: Application
    Filed: January 19, 2010
    Publication date: December 9, 2010
    Applicant: ARM LIMITED
    Inventors: Andreas BJÖRKLUND, Erik Persson, Ola Hugosson
  • Publication number: 20100313060
    Abstract: A data processing apparatus and method are provided for performing a predetermined rearrangement operation. The data processing apparatus comprises a vector register bank having a plurality of vector registers, with each vector register comprising a plurality of storage cells such that the plurality of vector registers provide a matrix of storage cells. Each storage cell is arranged to store a data element. A vector processing unit is provided for executing a sequence of vector instructions in order to apply operations to the data elements held in the vector register bank. Responsive to a vector matrix rearrangement instruction specifying a predetermined rearrangement operation to be performed on the data elements in the matrix of storage cells, the vector processing unit is arranged to issue a set rearrangement enable signal to the vector register bank.
    Type: Application
    Filed: January 19, 2010
    Publication date: December 9, 2010
    Applicant: ARM LIMITED
    Inventors: Andreas Björklund, Erik Persson, Ola Hugosson
  • Patent number: 7844959
    Abstract: A general purpose high-performance distributed execution engine for coarse-grained data-parallel applications is proposed that allows developers to easily create large-scale distributed applications without requiring them to master concurrency techniques beyond being able to draw a graph of the data-dependencies of their algorithms. Based on the graph, a job manager intelligently distributes the work load so that the resources of the execution engine are used efficiently. During runtime, the job manager (or other entity) can automatically modify the graph to improve efficiency. The modifications are based on runtime information, topology of the distributed execution engine, and/or the distributed application represented by the graph.
    Type: Grant
    Filed: September 29, 2006
    Date of Patent: November 30, 2010
    Assignee: Microsoft Corporation
    Inventor: Michael A. Isard
  • Patent number: 7809931
    Abstract: A vector permutation system (100) for a single-instruction multiple-data microprocessor has a set of vector registers (110) which feed vectors to permutation logic (120) and then to a negate block (130) where they are permuted and selectively negated according to control parameters received from a selected one of a set of control registers (140). A control arrangement (145, 150) selects which control register is to provide the control parameters. In this way no separate permutation instructions are necessary or need to be executed, and no permutation parameters need to be stored in the vector registers (10). This leads to higher performance, a smaller vector registers file and hence a smaller size of the microprocessor and better program code density.
    Type: Grant
    Filed: October 6, 2003
    Date of Patent: October 5, 2010
    Assignee: Freescale Semiconductor, Inc.
    Inventor: Martin Raubuch
  • Patent number: 7809925
    Abstract: A vectorizable execution unit is capable of being operated in a plurality of modes, with the processing lanes in the vectorizable execution unit grouped into different combinations of logical execution units in different modes. By doing so, processing lanes can be selectively grouped together to operate as different types of vector execution units and/or scalar execution units, and if desired, dynamically switched during runtime to process various types of instruction streams in a manner that is best suited for each type of instruction stream. As a consequence, a single vectorizable execution unit may be configurable, e.g., via software control, to operate either as a vector execution or a plurality of scalar execution units.
    Type: Grant
    Filed: December 7, 2007
    Date of Patent: October 5, 2010
    Assignee: International Business Machines Corporation
    Inventors: Eric Oliver Mejdrich, Adam James Muff, Matthew Ray Tubbs
  • Patent number: 7779229
    Abstract: A processor arrangement having a strip structure for parallel data processing is configured so that local data from the individual processing units or strips is brought together in a rapid manner. Input data, intermediate data and/or output data from various processing units are linked together in an operation which is at least partially combinatory. The data linking operation is not clock controlled. The linking of the local data from various strips in this manner reduces delays in parallel data processing in the processor arrangement. The combinatory data linking operation can provide an overall data linking outcome within an individual clock cycle.
    Type: Grant
    Filed: February 12, 2003
    Date of Patent: August 17, 2010
    Assignee: NXP B.V.
    Inventor: Wolfram Drescher
  • Patent number: 7739479
    Abstract: A method of providing physics data within a game program or simulation using a hardware-based physics processing unit having unique architecture designed to efficiently calculate physics related data.
    Type: Grant
    Filed: November 19, 2003
    Date of Patent: June 15, 2010
    Assignee: NVIDIA Corporation
    Inventors: Jean Pierre Bordes, Curtis Davis, Monier Maher, Manju Hegde, Otto A. Schmid
  • Patent number: 7735090
    Abstract: A method, apparatus and article of manufacture to dynamically modify, terminate, or replace software components and connections (i.e., contracts) between components in a running assembly. Information about the component and contracts between components in a running assembly is used to determine an allowable sequence of management commands to transition the assembly of components from a current state to a specified goal state. At the same time, other components may continue to perform an operational workflow.
    Type: Grant
    Filed: December 15, 2005
    Date of Patent: June 8, 2010
    Assignee: International Business Machines Corporation
    Inventors: James E. Carey, Scott N. Gerard