Single Instruction, Multiple Data (simd) Patents (Class 712/22)
-
Publication number: 20140189296Abstract: A loop remainder mask instruction indicates a current iteration count of a loop as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop remainder mask instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates a number of data elements of the array past an end of a preceding portion of the array that are to be handled separately from the preceding portion, the end of the preceding portion being where the current iteration count is recorded.Type: ApplicationFiled: December 14, 2011Publication date: July 3, 2014Inventors: Elmoustapha Ould-Ahmed-Vall, Robert Valentine, Jesus Corbal, Andrey Naraikin, Suleyman Sair, Asaf Hargil, Miland B. Girkar, Bret T. Toll, Mark J. Charney
-
Patent number: 8769244Abstract: Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element.Type: GrantFiled: April 8, 2009Date of Patent: July 1, 2014Assignee: Nec CorporationInventor: Shorin Kyo
-
Publication number: 20140181467Abstract: Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.Type: ApplicationFiled: December 21, 2012Publication date: June 26, 2014Applicant: ADVANCED MICRO DEVICES, INC.Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
-
Patent number: 8762691Abstract: A data processing apparatus includes a plurality of processing elements arranged in a single instruction multiple data array. The apparatus includes an instruction controller operable to receive instructions from a plurality of instructions streams, and to transfer instructions from those instructions streams to the processing elements in the array, such that the data processing apparatus is operable to process a plurality of processing threads substantially in parallel with one another. A data transfer controller is provided which is operable to control transfer of data between the internal memory units associated with the processing elements, and memory external to the array.Type: GrantFiled: June 29, 2007Date of Patent: June 24, 2014Assignee: Rambus Inc.Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
-
Patent number: 8756270Abstract: A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.Type: GrantFiled: April 24, 2012Date of Patent: June 17, 2014Assignee: International Business Machines CorporationInventors: Lakshminarayana B. Arimilli, Bernard C. Drerup, Paul F. Lecocq, Hanhong Xue
-
Patent number: 8751655Abstract: A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.Type: GrantFiled: March 29, 2010Date of Patent: June 10, 2014Assignee: International Business Machines CorporationInventors: Lakshminarayana B. Arimilli, Bernard C. Drerup, Paul F. Lecocq, Hanhong Xue
-
Patent number: 8745358Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.Type: GrantFiled: September 4, 2012Date of Patent: June 3, 2014Assignee: Intel CorporationInventors: Julien Sebot, William W. Macy, Eric Debes, Huy V. Nguyen
-
Patent number: 8732437Abstract: Systems and methods for performing single instruction multiple data (SIMD) operations on a data set. The methods may include examining a structure of the data set to determine what reorganization may be necessary to facilitate SIMD processing. The method may include selecting a stored bit mask corresponding to the organization of the data set and loading the bit mask into an application specific register (ASR). Subsequently, the data may be reorganized inline according to the ASR as the data is loaded into the SIMD functional unit such that the SIMD functional unit may operate on the data set. The results of the SIMD operation may be written to a results register.Type: GrantFiled: January 26, 2010Date of Patent: May 20, 2014Assignee: Oracle America, Inc.Inventor: Lawrence A. Spracklen
-
Patent number: 8725990Abstract: A configurable SIMD engine in a video processor for executing video processing operations. The engine includes a SIMD component having a plurality of inputs for receiving input data and a plurality of outputs for providing output data. A plurality of execution units are included in the SIMD component. Each of the execution units comprise a first and a second data path, and are configured for selectively implementing arithmetic operations on a set of low precision or high precision inputs. Each of the execution units have a first configuration and a second configuration, such that the first data path and the second data path are combined to produce a single high precision output in the first configuration, and such that the first data path and the second data path are partitioned to produce a respective first low precision output and second low precision output in the second configuration.Type: GrantFiled: November 4, 2005Date of Patent: May 13, 2014Assignee: Nvidia CorporationInventors: Ashish Karandikar, Pooja Agarwal
-
Patent number: 8726252Abstract: A compiler of a single instruction multiple data (SIMD) information handling system (IHS) identifies “if-then-else” statements that offer opportunity for conditional branch conversion. The SIMD IHS employs a processor or processors to execute the executable program. During execution, the processor generates and updates SIMD lane mask information to track and manage the conditional branch loops of the executing program. The processor saves branch addresses and employs SIMD lane masks to identify conditional branch loops with different branch conditions than previous conditional branch loops. The processor may reduce SIMD IHS processing time during processing of compiled code of the original “if-then-else” statements. The processor continues processing next statements inline after all SIMD lanes are complete, while providing speculative and parallel processing capability for multiple data operations of the executable program.Type: GrantFiled: January 28, 2011Date of Patent: May 13, 2014Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Brian Flachs, Dorit Nuzman, Ira Rosen, Ulrich Weigand, Ayal Zaks
-
Register files for a digital signal processor operating in an interleaved multi-threaded environment
Patent number: 8713286Abstract: A processor device is disclosed and includes a memory and a sequencer that is responsive to the memory. The sequencer supports very long instruction word (VLIW) type instructions and at least one VLIW instruction packet uses a number of operands during execution. The processor device further includes a plurality of instruction execution units responsive to the sequencer and a plurality of register files. Each of the plurality of register files includes a plurality of registers and the plurality of register files are coupled to the plurality of instruction execution units. Further, each of the plurality of register files includes a number of data read ports and the number of data read ports of each of the plurality of register files is less than the number of operands used by the at least one VLIW instruction packet.Type: GrantFiled: April 26, 2005Date of Patent: April 29, 2014Assignee: QUALCOMM IncorporatedInventors: Muhammad Ahmed, Erich Plondke, Lucian Codrescu, William C. Anderson -
Patent number: 8700884Abstract: A processor in a data processing system executes a permutation instruction which identifies a first source register, at least one other source register, and a destination register. The first source register stores at least one in-range index value for the at least one other source register and at least one out-of-range index value for the at least one other source register. The at least one other source register stores a plurality of vector element values, wherein each in-range index value indicates which vector element value of the at least one other source register is to be stored into a corresponding vector element of the destination register. Each out-of-range index value is used to indicate which one of at least two predetermined constant values is to be stored into a corresponding vector element of the destination register. Partial table lookups using a permutation instruction shortens the time required to retrieve data.Type: GrantFiled: October 12, 2007Date of Patent: April 15, 2014Assignee: Freescale Semiconductor, Inc.Inventors: William C. Moyer, Imran Ahmed, Dan E. Tamir
-
Patent number: 8688959Abstract: Method, apparatus, and program means for shuffling data. The method of one embodiment comprises receiving a first operand having a set of L data elements and a second operand having a set of L control elements. For each control element, data from a first operand data element designated by the individual control element is shuffled to an associated resultant data element position if its flush to zero field is not set and a zero is placed into the associated resultant data element position if its flush to zero field is not set.Type: GrantFiled: September 10, 2012Date of Patent: April 1, 2014Assignee: Intel CorporationInventors: William W. Macy, Jr., Eric L. Debes, Patrice L. Roussel, Huy V. Nguyen
-
Patent number: 8688957Abstract: A system and method are configured to detect conflicts when converting scalar processes to parallel processes (“SIMDifying”). Conflicts may be detected for an unordered single index, an ordered single index and/or ordered pairs of indices. Conflicts may be further detected for read-after-write dependencies. Conflict detection is configured to identify operations (i.e., iterations) in a sequence of iterations that may not be done in parallel.Type: GrantFiled: December 21, 2010Date of Patent: April 1, 2014Assignee: Intel CorporationInventors: Mikhail Smelyanskiy, Yen-Kuang Chen, Daehyun Kim, Christopher J. Hughes, Victor W. Lee
-
Patent number: 8688958Abstract: A processor has a plurality of PEs (processing elements) that operate in parallel based on operation commands and an information collection unit that collects the data of the plurality of PEs, wherein each of the plurality of PEs holds data and a condition flag, supplies the data and the condition flag to the information collection unit upon receiving an operation command, and upon receiving an update request for updating the condition flag, updates the condition flag in accordance with the update request that was received; and the information collection unit, upon receiving the data and the condition flags, selects one PE based on a predetermined order of priority from among the PEs for which the received condition flags are active and both supplies the data of the selected PE as collection result data and supplies an update request for updating the condition flag of the PE that was selected.Type: GrantFiled: January 14, 2010Date of Patent: April 1, 2014Assignee: NEC CorporationInventor: Shohei Nomoto
-
Patent number: 8661225Abstract: A data processing apparatus and method and provided for handling vector instructions. The data processing apparatus has a register data store with a plurality of registers arranged to store data elements. A vector processing unit is then used to execute a sequence of vector instructions, with the vector processing unit having a plurality of lanes of parallel processing and having access to the register data store in order to read data elements from, and write data elements to, the register data store during the execution of the sequence of vector instructions. A skip indication storage maintains a skip indicator for each of the lanes of parallel processing. The vector processing unit is responsive to a vector skip instruction to perform an update operation to set within the skip indication storage the skip indicator for a determined one or more lanes.Type: GrantFiled: January 19, 2010Date of Patent: February 25, 2014Assignee: ARM LimitedInventors: Andreas Björklund, Erik Persson, Ola Hugosson
-
Patent number: 8656376Abstract: A method for providing intrinsic supports for a VLIW DSP processor with distributed register files comprises the steps of: generating a program representation with cluster information on instructions of the DSP processor, wherein the cluster information is provided by a program with cluster intrinsic coding; identifying data stream operations indicating parallel instruction sequences applied on different data sets in the program representation; identifying data sharing relations indicating data shared by the data stream operations in the program representation; identifying data aggregation relations indicating results aggregated from the data stream operations in the program representation; and performing register allocation for the DSP processor according to the identified data stream operations, the data sharing relations and the data aggregation relations.Type: GrantFiled: September 1, 2011Date of Patent: February 18, 2014Assignee: National Tsing Hua UniversityInventors: Jenq Kuen Lee, Chi Bang Kuan
-
Publication number: 20140032879Abstract: Search circuitry responsive to a single instruction for undertaking a step of a search of a data array for an extreme value therein, a method of searching a data array to identify an extreme value therein and a location thereof and a single-instruction, multiple-data (SIMD) processing unit incorporating the search circuitry or the method. In one embodiment, the search circuitry includes: a comparison element configured to compare two values in the data array, (2) multiplexers coupled to the comparison element and configured to select a more extreme value of the two values and a location in the data array of the more extreme value and (3) an incrementer configured to increment a counter associated with the search.Type: ApplicationFiled: July 26, 2012Publication date: January 30, 2014Applicant: VeriSilicon Holdings Co., LtdInventor: Stephen E. Jarboe
-
Patent number: 8638805Abstract: Described embodiments provide for restructuring a scheduling hierarchy of a network processor having a plurality of processing modules and a shared memory. The scheduling hierarchy schedules packets for transmission. The network processor generates tasks corresponding to each received packet associated with a data flow. A traffic manager receives tasks provided by one of the processing modules and determines a queue of the scheduling hierarchy corresponding to the task. The queue has a parent scheduler at each of one or more next levels of the scheduling hierarchy up to a root scheduler, forming a branch of the hierarchy. The traffic manager determines if the queue and one or more of the parent schedulers of the branch should be restructured. If so, the traffic manager drops subsequently received tasks for the branch, drains all tasks of the branch, and removes the corresponding nodes of the branch from the scheduling hierarchy.Type: GrantFiled: September 30, 2011Date of Patent: January 28, 2014Assignee: LSI CorporationInventors: Balakrishnan Sundararaman, Shashank Nemawarkar, David Sonnier, Shailendra Aulakh, Allen Vestal
-
Patent number: 8639914Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.Type: GrantFiled: December 29, 2012Date of Patent: January 28, 2014Assignee: Intel CorporationInventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
-
Patent number: 8635432Abstract: There is provided an SIMD processor array system in which data can be efficiently transferred between processor elements located at different distances. The SIMD processor array system includes a control processor (CP) that is capable of issuing a plurality of instructions at the same time, and a PE array that includes a plurality of mutually-connected processing elements (PEs) to be controlled by the CP. The CP issues an inter-PE data shift instruction to each PE. According to the inter-PE data shift instruction, each PE performs a data sending operation of copying all the contents of a transfer data storing part of an adjoining PE to a transfer data storing part (MBF) of the own PE, and a data fetch operation of copying part or all of the contents of the MBF of the adjoining PE to a transfer data fetch and storing part (RBUF) of the own PE if part of the contents the MBF of the adjoining PE coincide with the contents of an ID storing part (IDB) of the own PE.Type: GrantFiled: March 4, 2009Date of Patent: January 21, 2014Assignee: NEC CorporationInventor: Shorin Kyo
-
Publication number: 20140013077Abstract: A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented.Type: ApplicationFiled: September 10, 2013Publication date: January 9, 2014Applicant: Oracle International CorporationInventors: Amit Ganesh, Shasank K. Chavan, Vineet Marwah, Jesse Kamp, Anindya C. Patthak, Michael J. Gleeson, Allison L. Holloway, Roger Macnicol
-
Publication number: 20140013078Abstract: A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented.Type: ApplicationFiled: September 10, 2013Publication date: January 9, 2014Applicant: Oracle International CorporationInventors: Amit Ganesh, Shasank K. Chavan, Vineet Marwah, Jesse Kamp, Anindya C. Patthak, Michael J. Gleeson, Allison L. Holloway, Roger Macnicol
-
Patent number: 8612732Abstract: One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application program that rely on specific features of the multi-core GPU are converted by a translator for execution by a general purpose CPU. The application program is partitioned into regions of synchronization independent instructions. The instructions are classified as convergent or divergent and divergent memory references that are shared between regions are replicated. Thread loops are inserted to ensure correct sharing of memory between various threads during execution by the general purpose CPU.Type: GrantFiled: March 19, 2009Date of Patent: December 17, 2013Assignee: NVIDIA CorporationInventors: Vinod Grover, Bastiaan Joannes Matheus Aarts, Michael Murphy, Boris Beylin, Jayant B. Kolhe, Douglas Saylor
-
Patent number: 8612507Abstract: A computing device includes: a deciding unit which, in computation of values of nodes on a lattice in a direction where a value of m representing a horizontal axis coordinate of the lattice increases, decides dummy nodes to be added to m=n?1, so as to enable values of nodes on m=n to be calculated by adding the dummy nodes to m=n?1 and executing a vector operation through the use of the SIMD function by using values of nodes on m=n?1 and values of the added dummy nodes; an adding unit adding the dummy nodes decided by the deciding unit to m=n?1; and a calculating unit calculating the values of the nodes present on m=n by executing the vector operation through the use of the SIMD function by using the values of the nodes on m=n?1 and the values of the dummy nodes added by the adding unit.Type: GrantFiled: April 16, 2010Date of Patent: December 17, 2013Assignee: NS Solutions CorporationInventor: Hiroki Takeshita
-
Patent number: 8605099Abstract: A technique to increase memory bandwidth for throughput applications. In one embodiment, memory bandwidth can be increased, particularly for throughput applications, without increasing interconnect trace or pin count by pipelining pages between one or more memory storage areas on half cycles of a memory access clock.Type: GrantFiled: March 31, 2008Date of Patent: December 10, 2013Assignee: Intel CorporationInventor: Eric Sprangle
-
Patent number: 8601246Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.Type: GrantFiled: June 27, 2002Date of Patent: December 3, 2013Assignee: Intel CorporationInventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
-
Publication number: 20130318324Abstract: A minicore-based reconfigurable processor and a method of flexibly processing multiple data using the same are provided. The reconfigurable processor includes minicores, each of the minicores including function units configured to perform different operations, respectively. The reconfigurable processor further includes a processing unit configured to activate two or more function units of two or more respective minicores, among the minicores, that are configured to perform an operation of a single instruction multiple data (SIMD) instruction, the processing unit further configured to execute the SIMD instruction using the activated two or more function units.Type: ApplicationFiled: February 13, 2013Publication date: November 28, 2013Applicant: Samsung Electronics Co., Ltd.Inventor: Dong-Kwan SUH
-
Patent number: 8595467Abstract: Mechanisms are provided for performing a floating point collect and operate for a summation across a vector for a dot product operation. A routing network placed before the single instruction multiple data (SIMD) unit allows the SIMD unit to perform a summation across a vector with a singe stage of adders. The routing network routes the vector elements to the adders in a first cycle. The SIMD unit stores the results of the adders into a results vector register. The routing network routes the summation results from the results vector register to the adders in a second cycle. The SIMD unit then stores the results from the second cycle in the results vector register.Type: GrantFiled: December 29, 2009Date of Patent: November 26, 2013Assignee: International Business Machines CorporationInventors: Brian K. Flachs, Seiji Maeda, Steven Osman
-
Integrated circuit device, electronic apparatus and method for manufacturing of electronic apparatus
Patent number: 8587568Abstract: An integrated circuit device includes a host I/F, an information register, and a control section. The information register stores wave selection information for selecting waveform information which defines a waveform of a drive signal of the electro-optical device. Waveform information selected by the wave selection information stored in the information register from among a plurality of pieces of waveform information is loaded to an information memory at the time of manufacturing an electronic apparatus including the electro-optical device. The control section controls the display of the electro-optical device on the basis of the waveform information read from the information memory at the time of an actual operation of the electronic apparatus.Type: GrantFiled: March 24, 2010Date of Patent: November 19, 2013Assignee: Seiko Epson CorporationInventor: Hideki Ogawa -
Patent number: 8578387Abstract: An embodiment of a computing system is configured to process data using a multithreaded SIMD architecture that includes heterogeneous processing engines to execute a program. The program is constructed of various program instructions. A first type of the program instructions can only be executed by a first type of processing engine and a third type of program instructions can only be executed by a second type of processing engine. A second type of program instructions can be executed by the first and the second type of processing engines. An assignment unit may be configured to dynamically determine which of the two processing engines executes any program instructions of the second type in order to balance the workload between the heterogeneous processing engines.Type: GrantFiled: July 31, 2007Date of Patent: November 5, 2013Assignee: Nvidia CorporationInventors: Peter C. Mills, Stuart F. Oberman, John Erik Lindholm, Samuel Liu
-
Patent number: 8572355Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.Type: GrantFiled: September 13, 2010Date of Patent: October 29, 2013Assignee: Nvidia CorporationInventors: Guillermo Juan Rozas, Brett W. Coon
-
Patent number: 8560809Abstract: According to some embodiments, a technique provides for the execution of an instruction that includes receiving residual data of a first image and decoded pixels of a second image, zero-extending a plurality of unsigned data operands of the decoded pixels producing a plurality of unpacked data operands, adding a plurality of signed data operands of the residual data to the plurality of unpacked data operands producing a plurality of signed results; and saturating the plurality of signed results producing a plurality of unsigned results.Type: GrantFiled: November 15, 2011Date of Patent: October 15, 2013Assignee: Intel CorporationInventors: Bradley C. Aldrich, Nigel C. Paver, Murli Ganeshan
-
Patent number: 8539202Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.Type: GrantFiled: June 12, 2012Date of Patent: September 17, 2013Assignee: Intel CorporationInventor: Patrice Roussel
-
Patent number: 8539201Abstract: Systems, methods and articles of manufacture are disclosed for transposing array data on a SIMD multi-core processor architecture. A matrix in a SIMD format may be received. The matrix may comprise a SIMD conversion of a matrix M in a conventional data format. A mapping may be defined from each element of the matrix to an element of a SIMD conversion of a transpose of matrix M. A SIMD-transposed matrix T may be generated based on matrix M and the defined mapping. A row-wise algorithm may be applied to T, without modification, to operate on columns of matrix M.Type: GrantFiled: November 4, 2009Date of Patent: September 17, 2013Assignee: International Business Machines CorporationInventors: Jeffrey S. McAllister, Timothy J. Mullins, Nelson Ramirez, Mark A. Bransford
-
Patent number: 8532288Abstract: A cryptographic engine for modulo N multiplication, which is structured as a plurality of almost identical, serially connected Processing Elements, is controlled so as to accept input in blocks that are smaller than the maximum capability of the engine in terms of bits multiplied at one time. The serially connected hardware is thus partitioned on the fly to process a variety of cryptographic key sizes while still maintaining all of the hardware in an active processing state.Type: GrantFiled: December 1, 2006Date of Patent: September 10, 2013Assignee: International Business Machines CorporationInventors: Camil Fayad, John K. Li, Siegfried K. H. Sutter, Phil C. Yeh
-
Publication number: 20130227249Abstract: A three-dimensional (3D) permute unit for a single-instruction-multiple-data stacked processor includes a first vector permute subunit and a second vector permute subunit. The first and second vector permute subunits are arranged in different layers of a 3D chip package. The vector permute subunits are each configured to process a portion of at least two input vectors. A first contact sub-field of the first vector permute subunit is configured to connect output ports of a first crossbar of the first vector permute subunit, holding an intermediate result of the first vector permute subunit, to a second contact sub-field of the second vector permute subunit. A first contact sub-field of the second vector permute subunit is configured to connect output ports of a first crossbar of the second vector permute subunit, holding an intermediate result of the second vector permute subunit, to a second contact sub-field of the first vector permute subunit.Type: ApplicationFiled: February 25, 2013Publication date: August 29, 2013Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventor: INTERNATIONAL BUSINESS MACHINES CORPORATION
-
Patent number: 8521994Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to pack the packed data responsive to a pack instruction received by the decoder. A first packed data element and a second packed data element are received from the first source register. A third packed data element and a fourth packed data element are received from the second source register. The circuit packs packing a portion of each of the packed data elements into a destination register resulting with the portion from second packed data element adjacent to the portion from the first packed data element, and the portion from the fourth packed data element adjacent to the portion from the third packed data element.Type: GrantFiled: December 22, 2010Date of Patent: August 27, 2013Assignee: Intel CorporationInventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
-
Patent number: 8510536Abstract: Techniques for vector completion mask (VCM) handling are provided. A data structure includes a mask field for each operand of a particular operation. A processor attempts to execute the operation with multiple operands, which are identified in the data structure by the mask fields. If operands are successfully retrieved for execution with the operation, then the corresponding mask field within the data structure is cleared. The processor can reset if any field remains set within the data structure and can re-process the operation with operands that were not previously handled with the operation.Type: GrantFiled: June 28, 2012Date of Patent: August 13, 2013Assignee: Intel CorporationInventors: Stephan Jourdan, Michael Fetterman, Michael Cornaby, Per Hammarlund, Ronak Signhal, Glenn Hinton
-
Patent number: 8493979Abstract: Executing a single instruction/multiple data (SIMD) instruction of a program to process a vector of data wherein each element of the packet vector corresponds to a different received packet.Type: GrantFiled: December 30, 2008Date of Patent: July 23, 2013Assignee: Intel CorporationInventors: Bryan E. Veal, Travis T. Schluessler
-
Patent number: 8495253Abstract: An article of manufacture, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes the host computer system obtaining a transport command word (TCW) for an I/O operation having both input and output data. The TCW specifies a location of the output data and a location for storing the input data. The host computer system forwards the I/O operation to the control unit for execution. The host computer system gathers the output data responsive to the location of the output data specified by the TCW, and then forwards the output data to the control unit for use in the execution of the I/O operation. The host computer system receives the input data from the control unit and stores the input data at the location specified by the TCW.Type: GrantFiled: March 30, 2011Date of Patent: July 23, 2013Assignee: International Business Machines CorporationInventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann
-
Publication number: 20130185538Abstract: A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage. The instruction stream includes scalar instructions executable by the scalar processor core and vector instructions executable by the vector coprocessor core. The scalar processor core is configured to pass the vector instructions to the vector coprocessor core. The vector coprocessor core configured to process a plurality of data values in parallel while executing each vector instruction passed by the scalar processor core. The vector coprocessor core includes a plurality of processing paths arranged in parallel to process the data values. Each of the processing paths includes an execution unit. Each of the execution units is configured to communicate a result of processing to each other of the execution units.Type: ApplicationFiled: July 13, 2012Publication date: July 18, 2013Applicant: TEXAS INSTRUMENTS INCORPORATEDInventors: Ching-Yu Hung, Shinri Inamori, Jagadeesh Sankaran, Peter Chang
-
Patent number: 8478969Abstract: In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.Type: GrantFiled: September 24, 2010Date of Patent: July 2, 2013Assignee: Intel CorporationInventor: Eric S. Sprangle
-
Publication number: 20130166878Abstract: Operation parallelism of a data processor is enhanced by floating-point inner product execution units compatible with single instruction multiple data (SIMD). An operating system that can significantly enhance the level of operation parallelism per instruction while maintaining efficiency of floating-point length-4 vector inner product execution units is implemented. The floating-point length-4 vector inner product execution units are defined in the minimum width (32 bits for single precision) even where an extensive operating system becomes available, and compose the inner product execution units to be compatible with SIMD. The mutually augmenting effects of the inner product execution units and SIMD-compatible composition enhances the level of operation parallelism dramatically.Type: ApplicationFiled: November 28, 2012Publication date: June 27, 2013Applicant: RENESAS ELECTRONICS CORPORATIONInventor: Renesas Electronics Corporation
-
Patent number: 8464025Abstract: A signal processing apparatus able to raise a processing capability in processing accompanying access to a storing means is provided. Stream control units (SCU) 203—0 to 203—3 access data at an external memory system or local memories 204—0 to 204—3 according to a thread under control from a host processor. Processor units (PU) arrays 202—0 to 202—3 perform image processing by a different thread from the thread of the SCUs 203—0 to 203—3.Type: GrantFiled: May 22, 2006Date of Patent: June 11, 2013Assignee: Sony CorporationInventors: Yuji Yamaguchi, Masatoshi Imai, Toshiharu Noda, Naosuke Asari, Tomoo Mitsunaga, Mitsuharu Ohki, Kazumasa Ito, Hidetoshi Nagano, Sumito Arakawa, Kei Ito
-
Publication number: 20130145120Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.Type: ApplicationFiled: January 29, 2013Publication date: June 6, 2013Inventors: Yen-Kueng Chen, William W. Macy, Jr., Matthew Holliman, Eric L. Debes, Minerva M. Yeung
-
Patent number: 8458442Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.Type: GrantFiled: August 26, 2009Date of Patent: June 4, 2013Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
-
Publication number: 20130138917Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.Type: ApplicationFiled: January 29, 2013Publication date: May 30, 2013Inventors: Yen-Kueng Chen, William W. Macy, JR., Matthew Holliman, Eric L. Debes, Minerva M. Yeung
-
Patent number: 8447953Abstract: A microprocessor architecture comprises a plurality of processing elements arranged in a single instruction multiple data SIMD array, wherein each processing element includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, a serial processor which includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, and an instruction controller operable to receive a plurality of instructions, and to distribute received instructions to the execution units in dependence upon the instruction types of the received instruction. The execution units of the serial processor are operable to process respective instructions in parallel.Type: GrantFiled: February 7, 2006Date of Patent: May 21, 2013Assignee: Rambus Inc.Inventor: Leon David Wildman
-
Publication number: 20130124824Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.Type: ApplicationFiled: December 27, 2012Publication date: May 16, 2013Inventors: Yen-Kueng Chen, William W. Macy, Matthew Holliman, Eric L. Debes, Minerva M. Yeung